Welcome

About this handbook

Objective

A free open-access digital R reference book catered to epidemiologists and public health practitioners that is usable offline and addresses common epidemiological tasks via clear text explanations, step-by-step instructions, and best practice R code examples

Epis using R must often Google search and read dozens of forum pages to complete common data manipulation and visualization epi tasks. However, field epidemiologists often work in low internet-connectivity environments and have limited technical support. This handbook aims to fill this gap.

How to read this handbook:

  • The is an HTML file which can be viewed offline, and is best viewed with Google Chrome.

  • Search via the search box above the Table of Contents. Ctrl+f will search across the current page.

  • Click the “clipboard” icon in the upper-right of each code chunk to copy it.

Version
The latest version of this handbook can be found at this github repository.

Acknowledgements

Contributors

Editor-in-Chief: Neale Batra ()

Editorial core team:

Authors:

Reviewers:

Advisers

Funding and programmatic support

TEPHINET
EAN

Data sources

outbreaks R package

Inspiration and templates

R4Epis
RECON packages
R4DS book (Hadley)
Bookdown book (Yihui)
Rmarkdown book (Yihui)

Image credits

Logo: CDC Public Image gallery; R Graph Gallery

I About this book

Style and editorial notes

Style

Text style

Package and function names

Package names are written in bold (e.g. dplyr) and functions are written like this: mutate(). Packages referenced either in text or within code like this: dplyr::mutate()

Types of notes

NOTE: This is a note

TIP: This is a tip.

CAUTION: This is a cautionary note.

DANGER: This is a warning.

tidyverse

This handbook generally uses tidyverse R coding style. Read more here

Code readability

We chose to frequently write code on new lines, in order to offer more understandable comments. As a result, code that could be written like this:

obs %>% 
  group_by(name) %>%                    # group the rows by 'name'
  slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row

…is often written like this:

obs %>% 
  group_by(name) %>%   # group the rows by 'name'
  slice_max(
    date,              # keep row per group with maximum date value 
    n = 1,             # keep only the single highest row 
    with_ties = F)     # if there's a tie (of date), take the first row

Editorial decisions

Below, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool, please join/start a conversation on our Github page.

Table of package, function, and other editorial decisions

Subject Considered Outcome & date Brief rationale
Epiweeks aweek, lubridate lubridate, Dec 2020 consistency, package maintenance prospects

Datasets used

Here the datasets used in this handbook will be described and will be downloadable

  • Linelist (…)
  • Aggregated case counts (…)
  • GIS shapefile (…)
  • modeling dataset? (…)

II Basics

R Basics

Overview

This section is not meant as a comprehensive “learn basic R” tutorial. However, it does cover some of the fundamentals that can be useful for reference or for refreshing your memory.

See the tab on recommended training for more comprehensive tutorials.

Why use R?

Why use R?

As stated on the R project website, R is a programming language and environment for statistical computing and graphics. It is highly versatile, extensible, and community-driven.

Cost

R is free to use! There is a strong ethic in the community of free and open-source material.

Reproducibility

Conducting your data management and analysis through a programming language (compared to Excel or other primarily manual tool) enhances reproducibility, makes error-detection easier, and eases your workload.

Community

The broad R community is enormous and collaborative. New packages and tools are developed daily, and vetted by the community. Perhaps the largest organization of R users is R-Ladies, which likely has a chapter near you.

Packages

Packages

An R package is a shareable bundle of code and documentation that contains pre-defined functions. Users in the R community develop and share packages all the time, so chances are likely that a solution exists for you! You will install and use hundreds of packages in your use of R.

CRAN

CRAN (Comprehensive R Archive Network) is a public warehouse of R packages that have been published by R community members. Most often, R users download packages from CRAN.

Install vs. Load

To use a package, 2 steps must be implemented:

  1. The package must be installed (once), and
  2. The package must be loaded (each R session)

The basic function for installing a package is install.packages(), where the name of the package is provided in quotes. This can also be accomplished point-and-click by going to the RStudio “Packages” pane and clicking “Install”.

install.packages("tidyverse")

The basic function to load a package for use (after it has been installed) is library(), with the name of the package NOT in quotes.

library(tidyverse)

Using pacman

This handbook uses the package pacman (abbreviation for “package manager”), which offers the useful function p_load(). This function combines the above two steps into one - it installs and/or loads packages, depending on what is needed. If the package has not yet been installed, it will attempt to install from CRAN, and then load it.

Below, we load some of the packages used in this R basics page:

pacman::p_load(tidyverse, rio, here)

The function p_isinstalled() will test whether packages are installed already.

Install from github

Sometimes, you need to install the development version of a package, from a github repository. You can use p_load_gh() from pacman (this function is a wrapper around install_github() from devtools).

# install development version of package from github repository
p_install_gh("reconhub/epicontacts")

# load development version of package which you had downloaded from github repository
p_load_gh("reconhub/epicontacts")

Read more about pacman here

Install from ZIP or TAR

You could get the package from a URL:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")

Or download it to your computer in a zipped file:

Option 1:

library(devtools)
install_local("~/Downloads/dplyr-master.zip")

Option 2:

install.packages(path_to_source, repos = NULL, type="source")

install.packages("~/Downloads/dplyr-master.zip", repos=NULL, type="source")

Delete packages

Use p_delete() from pacman, or remove.packages() from base R. Alternatively, go find the folder which contains your library and manually delete the folder.

Dependencies

Packages often depend on other packages to work. These are called dependencies. If a dependency fails to install, then the package depending on it may also fail to install.

See the dependencies of a package with p_depends(), and see which packages depend on it with p_depends_reverse()

Masked functions

It is not uncommon that two or more packages contain the same function name. For example, the package dplyr has a filter() function, but so does the package stats. The default filter() function depends on the order these packages are first loaded in the R session - the later one will be the default for the command filter().

You can check the order in your Environment pane of R Studio - click the drop-down for “Global Environment” and see the order of the packages. Functions from higher packages will mask those of the same name in lower packages. When first loading a package, R will warn you in the console if masking is occurring, but this is easy to miss.

Here are ways you can fix masking:

  1. Specify the package name in the command. For example, use dplyr::filter()
  2. Re-arrange the order in which the packages are loaded (e.g. within library() or p_load()), and re-start R
  3. detach() the desired package and re-attach it, thus making it the highest/default version.

Installing older versions of packages

See this guide

Installation

Installation

How to install R

Visit this website https://www.r-project.org/ and download the latest version of R suitable for your computer.

How to install R Studio

Visit this website https://rstudio.com/products/rstudio/download/ and download the latest free Desktop version of RStudio suitable for your computer.

How to update R and RStudio

Other things you may need to install:

  • TinyTeX (for compiling an RMarkdown document to PDF)
  • Pandoc (for compiling RMarkdown documents)
  • RTools (for building packages for R)

TinyTex

TinyTex

See https://yihui.org/tinytex/

To install from R:

install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()

Pandoc

Pandoc

Pandoc is a document converter, a separate software from R. It comes bundled with RStudio. It helps the process of converting Rmarkdown documents to formats like .pdf and adding complex functionality.

RTools

TinyTex

RTools is a collection of software for building packages for R

Install from this website: https://cran.r-project.org/bin/windows/Rtools/

RStudio

RStudio

RStudio Orientation

First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.

For RStudio to function you must also have R installed on the computer (see this section for installation instructions).

RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward!

By default RStudio displays four rectangle panes.

TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.

The R Console Pane

The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.

If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.

The Source Pane
This pane, by default in the upper-left, is space to edit and run your scripts. This pane can also display datasets (data frames) for viewing.

For Stata users, this pane is similar to your Do-file and Data Editor windows.

The Environment Pane
This pane, by default the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). Click on the arrow next to a dataframe name to see its variables.

In Stata, this is most similar to Variables Manager window.

Plots, Packages, and Help Pane
The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).

This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.

RStudio settings

Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options. There you can change the default settings, including appearance/background color.

Scripts

Scripts

Scripts are a fundamental part of programming. Storing your code in a script (vs. typing in the console) has many advantages:

  • Reproducibility
  • Version control
  • Commenting

Rmarkdown

Rmarkdown

Rmarkdown is a type of script in which the script itself becomes a document (PDF, Word, HTML, Powerpoint, etc.). See the handbook page on Rmarkdown documents.

R notebooks

R notebooks

There is no difference between writing in a Rmarkdown vs an R notebook. However the execution of the document differs slightly. See this site for more details.

R Shiny

R Shiny

Shiny apps are contained within one script, which must be named app.R. This file has three components:

  1. A user interface (ui)
  2. A server function
  3. A call to the shinyApp function

See the handbook page on Shiny basics, or this online tutorial: Shiny tutorial

In older versions, the above file was split into two files (ui.R and server.R)

Working directory

Working directory

These tabs cover how to use R working directories, and how this changes when you are working within an R project. The working directory is the root file location used by R for your work.
By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.

NOTE: If using an [R project](#rproject), the working directory will default to the R project root folder **IF** you open RStudio by clicking open the R project (the file with .rproj extension))

Set by command

Use the command setwd() with the filepath in quotations, for example: setwd("C:/Documents/R Files")

CAUTION: If using an RMarkdown script be aware of the following:

In an R Markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved to. If you want to change this, you can use setwd() as above, but know the change will only apply to that specific code chunk.

To change the working directory for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:

knitr::opts_knit$set(root.dir = 'desired/filepath/here')

Set Manually

Setting your working directory manually (point-and-click)

From RStudio click: Session / Set Working Directory / Choose Directory (you will have to do this each time you open RStudio)

In an R project

If you are working in an R project, your working directory will by default be the root folder. This is convenient to maximize with the here package (LINK).

Objects

Objects

Everything in R is an object. These sections will explain:

  • How to create objects (<-)
  • Types of objects (e.g. data frames, vectors..)
  • How to access subparts of objects (e.g. variables in a dataset)
  • Classes of objects (e.g. numeric, character, factor)

Everything is an object

Everything is an object

Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.

An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.

Defining objects (<-)

Defining objects

Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:

object_name <- value (or process/calculation that produce a value)

EXAMPLE: You may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object reporting_week is created when it is assigned the character value "2018-W10" (the quote marks make these a character value).
The object reporting_week will then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.

See the R commands and their output in the boxes below.

reporting_week <- "2018-W10"   # this command creates the object reporting_week by assigning it a value
reporting_week                 # this command prints the current value of reporting_week object in the console
## [1] "2018-W10"

NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output

CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.

The following command will re-define the value of reporting_week:

reporting_week <- "2018-W51"   # assigns a NEW value to the object reporting_week
reporting_week                 # prints the current value of reporting_week in the console
## [1] "2018-W51"

Datasets are also objects and must be assigned names when they are imported.

In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package.

# linelist is created and assigned the value of the imported CSV file
linelist <- rio::import("my_linelist.csv")

You can read more about importing and exporting datasets with the section on importing data.

CAUTION: A quick note on naming of objects:

  • Object names must not contain spaces, but you should use underscore (_) or a period (.) instead of a space.
  • Object names are case-sensitive (meaning that Dataset_A is different from dataset_A).
  • Object names must begin with a letter (cannot begin with a number like 1, 2 or 3).

Object structure

Object structure

Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.

The graphic below, sourced from this online R tutorial shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS section.

In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:

Common structure Explanation Example
Vectors A container for a sequence of singular objects, all of the same class (e.g. numeric, character). “Variables” (columns) in data frames are vectors (e.g. the variable age_years).
Data Frames Vectors (e.g. columns) that are bound together that all have the same number of rows. linelist is a data frame.

Note that to create a vector that “stands alone”, or is not part of a data frame (such as a list of location names), the function c() is often used:

list_of_names <- c("Ruhengeri", "Gisenyi", "Kigali", "Butare")

Object classes

Object classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class Explanation Examples
Character These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. “Character objects are in quotation marks”
Numeric These are numbers and can include decimals. If within quotation marks the will be considered character. 23.1 or 14
Integer Numbers that are whole only (no decimals) -5, 14, or 2000
Factor These are vectors that have a specified order or hierarchy of values Variable msf_involvement with ordered values N, S, SUB, and U.
Date Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Dates for more information. 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980
Logical Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) TRUE or FALSE
data.frame A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each.

You can test the class of an object by feeding it to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.

class(linelist$age)     # class should be numeric
## [1] "numeric"
class(linelist$gender)  # class should be character
## [1] "character"

Often, you will need to convert objects or variables to another class.

Function Action
as.character() Converts to character class
as.numeric() Converts to numeric class
as.integer() Converts to integer class
as.Date() Converts to Date class - Note: see section on dates for details
as.factor() Converts to factor - Note: re-defining order of value levels requires extra arguments

Here is more online material on classes and data structures in R.

Columns/Variables ($)

Columns/Variables

Vectors within a data frame (variables in a dataset) can be called, referenced, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. The $ symbol must be used, otherwise R will not know where to look for or create the column.

In this handbook, we use the word “column” instead of “variable”.

# Retrieve the length of the vector age_years
length(linelist$age) # (age is a variable in the linelist data frame)

By typing the name of the data frame followed by $ you will also see a list of all variables in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!

knitr::include_graphics(here::here("images", "Calling_Names.gif"))

ADVANCED TIP: Some more complex objects (e.g. an epicontacts object may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset) .

Access with brackets ([])

Access with brackets ([])

You may need to view parts of objects, which is often done using the square brackets [ ].

To view specific rows and columns of a dataset, you can do this using the syntax dataframe[rows, columns]:

# View a specific row (2) from dataset, with all columns
linelist[2,]

# View all rows, but just one column
linelist[, "date_onset"]

# View values from row 2 and columns 5 through 10
linelist[2, 5:10] 

# View values from row 2 and columns 5 through 10 and 18
linelist[2, c(5:10, 18)] 

# View rows 2 through 20, and specific columns
linelist[2:20, c("date_onset", "outcome", "age")]

# View rows and columns based on criteria
# *** Note the dataframe must still be names in the criteria!
linelist[linelist$age > 25 , c("date_onset", "date_birth", "age")]

# Use View() to see the outputs in the RStudio Viewer pane (easier to read) 
# *** Note the capital "V" in View() function
View(linelist[2:20, "date_onset"])

# Save as a new object
new_table <- linelist[2:20, c("date_onset")] 

The square brackets also work to call specific parts of an object, such as output of a summary() function, or a vector:

# All of the summary
summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   15.09   22.00   67.00      88
#Just one part
summary(linelist$age)[2]  
## 1st Qu. 
##       6
my_vector <- c("a", "b", "c", "d", "e", "f")  # define the vector
my_vector[5]                                  # print the 5th element
## [1] "e"

Functions and packages

Functions and packages

This section on functions explains:

  • What a function is and how they work
  • What arguments are
  • What packages are
  • How to get help understanding a function

Simple functions

Simple functions

A function is like a machine that receives inputs, does some action with those inputs, and produces an output.
What the output is depends on the function.

Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:

sqrt(49)
## [1] 7

Functions can also be applied to variables in a dataset. For example, when the function summary() is applied to the numeric variable age in the dataset linelist (what’s the $ symbol?), the output is a summary of the variable’s numeric and missing values.

summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   15.09   22.00   67.00      88

NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.

Functions with multiple arguments

Functions with multiple arguments

Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.

  • Some arguments are required for the function to work correctly, others are optional.
  • Optional arguments have default settings if they are not specified.
  • Arguments can take character, numeric, logical (TRUE/FALSE), and other inputs.

For example, this age_pyramid() command produces an age pyramid graphic based on defined age groups and a binary split variable, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the data frame to use, age_group as the variable to count, and gender as the binary variable to use for splitting the pyramid by color.

NOTE: For this example, in the background we have created a new variable called “age_group”. To learn how to create new variable see that section of this handbook

# Creates an age pyramid by specifying the dataframe, age group variable, and a variable to split the pyramid
apyramid::age_pyramid(data = linelist, age_group = "age_group", split_by = "gender")

The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.

# This command will produce the exact same graphic as above
apyramid::age_pyramid(linelist, "age_group", "gender")

A more complex age_pyramid() command might include the optional arguments to:

  • Show proportions instead of counts (set proportional = TRUE when the default is FALSE)
  • Specify the two colors to use (pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)

NOTE: For arguments specified with an equals symbol (e.g. coltotals = ...), their order among the arguments is not important (must still be within the parentheses and separated by commas).

age_pyramid(linelist, "age_group", "gender", proportional = TRUE, pal = c("orange", "purple"))

Packages

Packages

Packages contain functions.

On installation, R contains “base” functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use.

One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.

Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, you access its functions by loading the package with the library() command at the beginning of each R session.

# this loads the package "tidyverse" for use in the current R session
library(tidyverse)

NOTE: While you only have to install a package once, you must load it at the beginning of every R session using library() command, or an alternative like pacman’s p_load() function.

Think of R as your personal library: When you download a package your library gains a book of functions, but each time you want to use a function in that book, you must borrow that book from your library.

For clarity in this handbook, functions are usually preceeded by the name of their package using the :: symbol in the following way:

package_name::function_name()

Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However giving the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()).
Using the package name will also load the package if it is not already loaded.

# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")

Dependencies
Packages often depend on other packages, and these are called “dependencies”. When a package is installed from CRAN, it will typically also install its dependenices.

Function help

Function help

To read more about a function, you can try searching online for resources OR search in the Help tab of the lower-right RStudio pane.

Piping (%>%)

Piping (%>%)

Two general approaches to working with objects are:

  1. Tidyverse/piping - sends an object from function to function - emphasizes the action, not the object
  2. Define intermediate objects - emphasis the object, as it is re-defined again and again

Pipes

Pipes

Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.

  • Piping emphasizes a sequence of actions, not the object the actions are being performed on

  • Best when a sequence of actions must be performed on one object

  • From magrittr. Included in dplyr and tidyverse

  • Makes code more clean and easier to read, intuitive

  • Express a sequence of operations

  • The object is altered and then passed on to the next function

Read more on this approach in the tidyverse style guide

Example:

# A fake example of how to bake a care using piping syntax

cake <- flour %>%       # to define cake, start with flour, and then...
  left_join(eggs) %>%   # add eggs
  left_join(oil) %>%    # add oil
  left_join(water) %>%  # add water
  mix_together(utensil = spoon, minutes = 2) %>%                # mix together
  bake(degrees = 350, system = "fahrenheit", minutes = 35) %>%  # bake
  let_cool()            # let it cool down

https://cfss.uchicago.edu/notes/pipes/#:~:text=Pipes%20are%20an%20extremely%20useful,code%20and%20combine%20multiple%20operations.

Piping is not a base function. To use piping, the dplyr package must be installed and loaded. Near the top of every template script is a code chunk that installs and loads the necessary packages, including dplyr. You can read more about piping in the documentation.

CAUTION: Remember that even when using piping to link functions, if the assignment operator (<-) is present, the object to the left will still be over-written (re-defined) by the right side.

%<>%
This is an “assignment pipe” from the magritter package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, so object %<>% function() %>% function() is the same as object <- object %>% function() %>% function().

Define intermediate objects

Define intermediate objects

This approach to changing objects/dataframes may be better if:

  • You need to manipulate multiple objects
  • There are intermediate steps that are meaningful and deserve separate object names

Risks:

  • Creating new objects for each step means creating lots of objects. If you use the wrong one you might not realize it!
    *Naming all the objects can be confusing
  • Errors may not be easily detectable

Either name each intermediate object, or overwrite the original, or combine all the functions together. All come with their own risks.

Below are some examples:

# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)

batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)

cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)

cake <- let_cool(cake)

Combine all functions together - also difficult to read

# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))

Key operators and functions

Key operators and functions

This section details operators in R, such as:

  • Definitional operators
  • Relational operators (less than, equal too..)
  • Logical operators (and, or…)
  • Handling missing values
  • Mathematical operators and functions (+/-, >, sum(), median(), …)
  • The %in% operator

Assignment operators

Assignment operators

<-

The basic assignment operator in R is <-. Such that object_name <- value (see R Basics tab on “Defining an Object”).
This assignment operator can also be written as =. We advise use of <- for general R use.
We also advise surrounding operators with spaces, for readability.

<<-

If writing functions (LINK TO PAGE), or using R in an interactive way with sourced scripts (LINK TO PAGE), then you may need to use this assignment operator <<- (base R). This operator is used to define an object in a higher ‘parent’ R Environment (LINK to tab on R environments). Also see this online reference.

%<>%

This is an “assignment pipe” from the magritter package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, so object %<>% function() %>% function() is the same as object <- object %>% function() %>% function().

%<+%

Used to add data to phylogenetic trees with the ggtree package. See the (LINK TO PAGE) or this online resource book.

Relational and logical operators

Relational and logical operators

Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:

Function Operator Example Example Result
Equal to == "A" == "a" FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <-
Not equal to != 2 != 0 TRUE
Greater than > 4 > 2 TRUE
Less than < 4 < 2 FALSE
Greater than or equal to >= 6 >= 4 TRUE
Less than or equal to <= 6 <= 4 FALSE
Value is missing is.na() is.na(7) FALSE (see section on missing values)
Value is not missing !is.na() !is.na(7) TRUE

Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.

Function Operator
AND &
OR | (vertical bar)
Parentheses ( ) Used to group criteria together and clarify order

For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:

linelist_cleaned <- linelist_cleaned %>%
  mutate(case_def = case_when(
    is.na(hep_e_rdt) & is.na(other_cases_in_hh)           ~ NA_character_,
    hep_e_rdt == "Positive"                               ~ "Confirmed",
    hep_e_rdt != "Positive" & other_cases_in_hh == "Yes"  ~ "Probable",
    TRUE                                                  ~ "Suspected"
  ))
Criteria in example above Resulting value in new variable “case_def”
If the value for variables hep_e_rdt and other_cases_in_hh are missing NA (missing)
If the value in hep_e_rdt is “Positive” “Confirmed”
If the value in hep_e_rdt is NOT “Positive” AND the value in other_cases_in_hh is “Yes” “Probable”
If one of the above criteria are not met “Suspected”

Note that R is case-sensitive, so “Positive” is different than “positive”…

Missing values

Missing values

In R, missing values are represented by the special value NA (a “reserved” value) (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA.

To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.

rdt_result <- c("Positive", "Suspected", "Positive", NA)   # two positive cases, one suspected, and one unknown
is.na(rdt_result)  # Tests whether the value of rdt_result is NA
## [1] FALSE FALSE FALSE  TRUE

Here is the R documentation on missing values

Variations on NA

NA is actually a logical value of length 1. You may also encounter NA_character_, NA_real_, NA_complex_, and NA_integer_, which correspond to specific classes.

The most prominent application of one of these variants in common epidemiology work is using case_when(). The Right-Hand Side (RHS) values must all be of the same class. Thus, if you have character outcomes on the RHS like “Confirmed”, “Suspect”, “Probable” and NA - you will get an error. Instead of NA you must have NA_character_. Likewise for integers, use NA_integer_.

NULL

NULL is the null object in R, often used to represent a list of 0 length. Use is.null() to evaluate this status.

More detail on the difference between NA and NULL is here

Mathematics and statistics

Mathematics and statistics

All the operators and functions in this page is automatically available using base R.

Mathematical operators

These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.

Objective Example in R
addition 2 + 3
subtraction 2 - 3
multiplication 2 * 3
division 30 / 5
exponent 2^3
order of operations ( )

Mathematical functions

Objective Function
rounding round(x, digits = n)
rounding janitor::round_half_up(x, digits = n)
ceiling (round up) ceiling(x)
floor (round down) floor(x)
absolute value abs(x)
square root sqrt(x)
exponent exponent(x)
natural logarithm log(x)

DANGER: round() uses “banker’s rounding” which rounds up from a .5 only if the upper number is even. Use round_half_up() from janitor to consistently round halves up to the nearest whole number. See this

# use the appropriate rounding function for your work
round(c(2.5, 3.5))
## [1] 2 4
janitor::round_half_up(c(2.5, 3.5))
## [1] 3 4

Statistical functions:

CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm=TRUE is specified

Objective Function
mean (average) mean(x, na.rm=T)
median median(x, na.rm=T)
standard deviation sd(x, na.rm=T)
quantiles* quantile(x, probs)
sum sum(x, na.rm=T)
minimum value min(x, na.rm=T)
maximum value max(x, na.rm=T)
range of numeric values range(x, na.rm=T)
summmary** summary(x)

DANGER: If providing a vector of numbers to one of the above functions, be sure to wrap the numbers within c() .

# If supplying raw numbers to a function, wrap them in c()
mean(1, 6, 12, 10, 5, 0)    # !!! INCORRECT !!!  
## [1] 1
mean(c(1, 6, 12, 10, 5, 0)) # CORRECT
## [1] 5.666667
  • quantile(): x is the numeric vector to examine, and probs is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85) ** summary(): gives a summary on a numeric vector including mean, median, and common percentiles

Other useful functions:

Objective Function Example
create a sequence seq(from, to, by) seq(1, 10, 2)
repeat x, n times rep(x, ntimes) rep(1:3, 2) or rep(c("a", "b", "c"), 3)
subdivide a numeric vector cut(x, n) cut(linelist$age, 5)

%in%

%in%

A very useful operator for matching values, and quickly assessing if a value is within a vector or dataframe.

my_vector <- c("a", "b", "c", "d")

"a" %in% my_vector
## [1] TRUE
"h" %in% my_vector
## [1] FALSE

To ask if a value is not %in%, put an exclamation mark (!) in front of the logic statement:

# to negate, put an exclamation in front
!"a" %in% my_vector
## [1] FALSE
!"h" %in% my_vector
## [1] TRUE

%in% is very useful when using the dplyr function case_when() to recode values in a column. For example:

linelist <- linelist %>% 
  mutate(hospital = case_when(
    hospital %in% c("St. Fr.", "Saint Francis") ~ "St. Francis")) # convert to correct spelling

Loading Packages

Loading Packages

This section describes the several ways to install a package:

  • Via the online package repository (CRAN)
  • From a ZIP file
  • From Github

CRAN

CRAN

From the CRAN online repository of packages

ZIP files

ZIP files

Download a ZIP file of a package, unpack it, and save it.

Github

Github

Download a package under development from Github repository

remotes package

Errors & Warnings

Errors & Warnings

This section explains:

  • General syntax for writing R code
  • Code assists
  • the difference between errors and warnings

Common errors and warnings and their solutions can be found in X section (TODO). See the handbook page on common errors and warnings.

Error versus Warning

Error versus warning

When a command is run, the R Console may show you warning or error messages in red text.

  • A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.

  • An error means that R was not able to complete your command.

Look for clues:

  • The error/warning message will often include a line number for the problem.

  • If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.

If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!

General syntax tips

General syntax tips

A few things to remember when writing commands in R, to avoid errors and warnings:

  • Always close parentheses - tip: count the number of opening “(” and closing parentheses “)” for each code chunk
  • Avoid spaces in column and object names. Use underscore ( _ ) or periods ( . ) instead
  • Keep track of and remember to separate a function’s arguments with commas
  • R is case-sensitive, meaning Variable_A is different from variable_A

Code assists

Code assists

Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.

(/images/Warnings_and_Errors.png)

Recommended training

Importing data

Overview

Introduction to importing data

Packages

Packages

The key package we recommend for importing data is: rio. rio offers the useful function import() which can import many types of files into R.

The alternative to using rio would be to use functions from several other packages that are specific to a type of file (e.g. read.csv(), read.xlsx(), etc.). While these alternatives can be difficult to remember, always using rio::import() is relatively easy.

Optionally, the package here can be used in conjunction with rio. It locates files on your computer via relative pathways, usually within the context of an R project. Relative pathways are relative from a designated folder location, so that pathways listed in R code will not break when the script is run on a different computer.

This code chunk shows the loading of packages for importing data.

# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(rio, here)

import()

import()

When you import a dataset, you are doing the following:

  1. Creating a new, named data frame object in your R environment
  2. Defining the new object as the imported dataset

The function import() from the package rio makes it easy to import many types of data files.

# An example:
#############
library(rio)                                                     # ensure package rio is loaded for use

# New object is defined as the imported data
my_csv_data <- import("linelist.csv")                            # importing a csv file

my_Excel_data <- import("observations.xlsx", which = "February") # import an Excel file

import() uses the file’s extension (e.g. .xlsx, .csv, .dta, etc.) to appropriately import the file. Any optional arguments specific to the filetype can be supplied as well.

You can read more about the rio package in this online vignette

https://cran.r-project.org/web/packages/rio/readme/README.html

CAUTION: In the example above, the datasets are assumed to be located in the working directory, or the same folder as the script.

TO DO

import a specific range of cells skip rows, in excel and csv rio table of functions used for import/export/convert https://cran.r-project.org/web/packages/rio/vignettes/rio.html other useful function to know as backup EpiInfo SAS STATA Google Spreadsheets R files

Import from filepath

Import from filepath

A filepath can be provided in full (as below) or as a relative filepath (see next tab). Providing a full filepath can be fast and may be the best if referencing files from a shared/network drive).

The function import() (from the package rio) accepts a filepath in quotes. A few things to note:

  • Slashes must be forward slashes, as in the code shown. This is NOT the default for Windows filepaths.
  • Filepaths that begin with double slashes (e.g. “//…”) will likely not be recognized by R and will produce an error. Consider moving these files to a “named” or “lettered” drive that begins with a letter (e.g. “J:” or “C:”). See the section on using Network Drive for more details on this issue.
# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("C:/Users/Neale/Documents/my_excel_file.xlsx")

Excel sheet

Excel sheets

If importing a specific sheet from an Excel file, include the sheet name in the which = argument of import(). For example:

# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("my_excel_file.xlsx", which = "Sheetname")

If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parenthese of the here() function.

# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
linelist_raw <- import(here("data", "linelists", "linelist.xlsx"), which = "Sheet1")`  

Select file manually

Select file manually

You can import data manually via one of these methods:

  • Environment RStudio Pane, click “Import Dataset”, and select the type of data
  • Click File / Import Dataset / (select the type of data)
  • To hard-code manual selection, use the base R command file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:
# A demonstration showing manual selection of a file. When this command is run, a POP-UP window should appear. 
# The filepath of the selected file will be supplied to the import() command.

my_data <- rio::import(file.choose())

TIP: The pop-up window may appear BEHIND your RStudio window.

Relative filepaths (here())

Relative filepaths (here())

Relative filepaths differ from static filepaths in that they are relative from a R project root directory. For example:

  • A static filepath: import("C:/Users/nsbatra/My Documents/R files/epiproject/data/linelists/ebola_linelist.xlsx")
    • Specific fixed path
    • Useful if multiple users are running a script hosted on a network drive
  • A relative filepath: import(here("data", "linelists", "ebola_linelist.xlsx"))
    • Path is given in relation to a root directory (typically the root folder of an R project)
    • Best if working within an R project, or planning to zip and share entire project with others

The package here and it’s function here() facilitate relative pathways.

here() works best within R projects. When the here package is first loaded (library(here)), it automatically considers the top-level folder of your R project as “here” - a benchmark for all other files in the project.

Thus, in your script, if you want to import or reference a file saved in your R project’s folders, you use the function here() to tell R where the file is in relation to that benchmark.

If you are unsure where “here” is set to, run the function here() with the empty brackets:

# This command tells you the folder path that "here" is set to 
here::here()

Below is an example of importing the file “fluH7N9_China_2013.csv” which is located in the benchmark “here” folder. All you have to do is provide the name of the file in quotes (with the appropriate ending).

linelist <- import(here("fluH7N9_China_2013.csv"))

If the file is within a subfolder - let’s say a “data” folder - write these folder names in quotes, separated by commas, as below:

linelist <- import(here("data", "fluH7N9_China_2013.csv"))

Using the here() command produces a character filepath, which can then processed by the import() function.

# the filepath
here("data", "fluH7N9_China_2013.csv")

# the filepath is given to the import() function
linelist <- import(here("data", "fluH7N9_China_2013.csv"))

NOTE: You can still import a specific sheet of an excel file as noted in the Excel tab. The here() command only supplies the filepath.

Google sheets

Google sheets

You can import data from an online Google spreadsheet with the googlesheet4 package and by authenticating your access to the spreadsheet.

pacman::p_load("googlesheets4")

Below, a demo Google sheet is imported and saved. This command may prompt confirmation of authentification of your Google account. Follow prompts and pop-ups in your internet browser to grant Tidyverse API packages permissions to edit, create, and delete your spreadsheets in Google Drive.

The sheet below is “viewable for anyone with the link” and you can try to import it.

Gsheets_demo <- read_sheet("https://docs.google.com/spreadsheets/d/1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY/edit#gid=0")

The sheet can also be imported using only the sheet ID, a shorter part of the URL:

Gsheets_demo <- read_sheet("1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY")

Another package, googledrive offers useful functions for writing, editing, and deleting Google sheets. For example, using the gs4_create() and sheet_write() functions found in this package.

Here are some other helpful online tutorials: basic importing tutorial more detail interaction between the two packages

Websites

Websites

Scraping data from a website - TBD

Skip rows

Skip rows

Sometimes, you may want to avoid importing a row of data (e.g. the column names, which are row 1).
you can do this with the argument skip = if using import() from the rio package on a .xlsx or .csv file. Provide the number of rows you want to skip.

linelist_raw <- import("linelist_raw.xlsx", skip = 1)  # does not import header row

Unfortunately skip = only accepts one integer value, not a range (e.g. “2:10”). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.

Removing a second header row

Your data may have a second row of data, for example if it is a “data dictionary” row (see example below).

This situation can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.

For Excel files:

# For excel files (remove 2nd row)
linelist_raw_names <- import("linelist_raw.xlsx") %>% names()  # save true column names

# import, skip row 2, assign to col_names =
linelist_raw <- import("linelist_raw.xlsx", skip = 2, col_names = linelist_raw_names) 

For CSV files:

# For csv files
linelist_raw_names <- import("linelist_raw.csv") %>% names() # save true column names

# note argument is 'col.names ='
linelist_raw <- import("linelist_raw.csv", skip = 2, col.names = linelist_raw_names) 

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/

TO DO

library(tidyr)
stickers_dict <- import("linelist_raw.xlsx") %>% 
  clean_names() %>% 
  gather(variable_name, variable_description)
stickers_dict

Manual data entry

Manual data entry

Entry by columns

Entry by columns

Since a data frame is a combination of vertical vectors (columns), R by default expects manual entry of data to also be in vertical vectors (columns).

# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death     <- c(1, 0, 1, 0)

CAUTION: All vectors must be the same length (same number of values).

The vectors can then be bound together using the function data.frame():

# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)

And now we display the new dataset:

# display the new dataset
DT::datatable(manual_entry_cols)

Entry by rows

Entry by rows

Use the tribble function from the tibble package from the tidverse (onlinetibble reference).

Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.).
You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. For example:

# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
                        ~colA, ~colB,
                        "a",   1,
                        "b",   2,
                        "c",   3
                      )

And now we display the new dataset:

# display the new dataset
DT::datatable(manual_entry_rows)

OR ADD ROWS dplyr TO DO

Pasting from clipboard

Pasting from clipboard

If you copy data from elsewhere and have it on your clipboard, you can try the following command to convert those data into an R data frame:

manual_entry_clipboard <- read.table(file = "clipboard",
                                     sep = "t",           # separator could be tab, or commas, etc.
                                     header=TRUE)         # if there is a header row

R projects

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Errors & Warnings

Overview

Troubleshooting common errors and warnings

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

#Tried to add a value ("Missing") to a factor (with replace_na operating on a factor)
Problem with `mutate()` input `age_cat`.
i invalid factor level, NA generated
i Input `age_cat` is `replace_na(age_cat, "Missing")`.invalid factor level, NA generated
# ran recode without re-stating the x variable in mutate(x = recode(x, OLD = NEW)
Error: Problem with `mutate()` input `hospital`.
x argument ".x" is missing, with no default
i Input `hospital` is `recode(...)`.

Error: Insufficient values in manual scale. 3 needed but only 2 provided. ggplot() scale_fill_manual() values = c(“orange”, “purple”) … insufficient for number of factor levels … consider whether NA is now a factor level…

Error: unexpected symbol in:
"  geom_histogram(stat = "identity")+
  tidyquant::geom_ma(n=7, size = 2, color = "red" lty"

If you see “unexpected symbol” check for missing commas

Wrong slashes If you see an error like this when you try to export or import:

No such file or directory:

Make sure you have used / within the filepath, not \.

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

III Data Management

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

  • Dealing with character case (upper, lower, title, etc.)
  • Factor columns

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               rio,        # importing data
               epikit)     # age_categories() function  

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

## New names:
## * `` -> ...28
linelist_raw <- import("linelist_raw.xlsx")

You can view the first 50 rows of the the original “raw” dataset below:

Cleaning pipeline

Cleaning pipeline

In epidemiological analysis and data processing, cleaning steps are often performed together and sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another. The chain utilizes dplyr verbs and the magrittr pipe operator (see handbook page on dplyr and tidyverse coding style (LINK HERE). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

In a cleaning pipeline the order of the steps is important. Cleaning steps might include:

  • Importing of data
  • Column names cleaned or changed
  • Rows filtered, added, or de-duplicated
  • Columns selected, added, transformed, or re-ordered
  • Values re-coded, cleaned, or grouped

Column names

Column names

Column names are used very often so they need to have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replaced with underscores (_),
  • No unusual characters (&, #…)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

Also note that in the raw data, the two final columns names were two merged cells with one name. The import() function used the name for the first of the two columns, and assigned the second column the name “…23” as it was then empty (referring to the 23rd column).

names(linelist_raw)
##  [1] "row_num"         "case_id"         "generation"      "infection date"  "date onset"      "hosp date"       "date_of_outcome"
##  [8] "outcome"         "gender"          "hospital"        "lon"             "lat"             "infector"        "source"         
## [15] "age"             "wt_kg"           "ht_cm"           "ct_blood"        "age_unit"        "fever"           "chills"         
## [22] "cough"           "aches"           "vomit"           "temp"            "time_admission"  "merged_header"   "...28"
Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic colummn name cleaning

Automatic column name cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can designate specific name replacements with the replace = argument (e.g. replace = c(onset = “date_of_onset”))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "row_num"         "case_id"         "generation"      "infection_date"  "date_onset"      "hosp_date"       "date_of_outcome"
##  [8] "outcome"         "gender"          "hospital"        "lon"             "lat"             "infector"        "source"         
## [15] "age"             "wt_kg"           "ht_cm"           "ct_blood"        "age_unit"        "fever"           "chills"         
## [22] "cough"           "aches"           "vomit"           "temp"            "time_admission"  "merged_header"   "x28"

NOTE: The column name “…28” was changed to “x28”.

Manual column name cleaning

Manual column name cleaning

Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.

Below, a re-name command is added to the cleaning pipeline:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

Rename by column position

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn = 1,
       newNameForSecondColumn = 2)

Empty Excel column names

If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see above), or their name (linelist_raw$...1).

Merged Excel column names

Merged Excel column names

Merged cells in an Excel file are a common occurrence when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accommodate merged cells.

Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the princiles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook("linelist_raw.xlsx", fillMergedCells = TRUE)

DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..

Skip import of rows

Skip import of rows

Sometimes, you may want to avoid importing a row of data (e.g. the column names, which are row 1).
you can do this with the argument skip = if using import() from the rio package on a .xlsx or .csv file. Provide the number of rows you want to skip.

linelist_raw <- import("linelist_raw.xlsx", skip = 1)  # does not import header row

Unfortunately skip = only accepts one integer value, not a range (e.g. “2:10”). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.

Removing a second header row

Your data may have a second row of data, for example if it is a “data dictionary” row (see example below).

This situation can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.

For Excel files:

# For excel files (remove 2nd row)
linelist_raw_names <- import("linelist_raw.xlsx") %>% names()  # save true column names

# import, skip row 2, assign to col_names =
linelist_raw <- import("linelist_raw.xlsx", skip = 2, col_names = linelist_raw_names) 

For CSV files:

# For csv files
linelist_raw_names <- import("linelist_raw.csv") %>% names() # save true column names

# note argument is 'col.names ='
linelist_raw <- import("linelist_raw.csv", skip = 2, col.names = linelist_raw_names) 

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/

TO DO

library(tidyr)
stickers_dict <- import("linelist_raw.xlsx") %>% 
  clean_names() %>% 
  gather(variable_name, variable_description)
stickers_dict

Combine two header rows

Combine two header rows

In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row. Replace “df” with the name of your dataset.

names(df) <- paste(names(df), df[1, ], sep = "_")

Select or re-order columns

Select or re-order columns

CAUTION: This tab may follow from previous tabs.

Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.

Here are all the column names in the linelist:

names(linelist)
##  [1] "row_num"              "case_id"              "generation"           "date_infection"       "date_onset"          
##  [6] "date_hospitalisation" "date_outcome"         "outcome"              "gender"               "hospital"            
## [11] "lon"                  "lat"                  "infector"             "source"               "age"                 
## [16] "wt_kg"                "ht_cm"                "ct_blood"             "age_unit"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

Select & re-order

Select & re-order

Select only the columns you want to remain, and their order of appearance

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names() # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation" "fever"

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained. Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
  names()
##  [1] "row_num"              "case_id"              "generation"           "date_infection"       "date_hospitalisation"
##  [6] "date_outcome"         "outcome"              "gender"               "hospital"             "lon"                 
## [11] "lat"                  "infector"             "source"               "age"                  "wt_kg"               
## [16] "ht_cm"                "ct_blood"             "age_unit"             "temp"                 "time_admission"      
## [21] "merged_header"        "x28"

Re-order the columns - use everything() to signify all other columns not specified in the select() command:

# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, gender, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation" "gender"               "row_num"             
##  [6] "generation"           "date_infection"       "date_outcome"         "outcome"              "hospital"            
## [11] "lon"                  "lat"                  "infector"             "source"               "age"                 
## [16] "wt_kg"                "ht_cm"                "ct_blood"             "age_unit"             "fever"               
## [21] "chills"               "cough"                "aches"                "vomit"                "temp"                
## [26] "time_admission"       "merged_header"        "x28"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
  • contains() - columns containing a character string. Example: select(contains("time"))
  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))

Here is an example using where():

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"
# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"             "fever"

select() as a stand-alone command

select() as a stand-alone command

Select can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add to the pipe chain

Add to the pipe chain

In the linelist, there are a few columns we do not need: row_num, merged_header, and x28. Remove them by adding a select() command to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28))

Modify class

Modify class

CAUTION: This tab may follow from previous tabs.

See section on object classes

Often you will need to set the correct class for a column. The most common approach is to use mutate() to define the column as itself, but with a different class. Generally, this looks like this:

# Examples of modifying class
linelist <- linelist %>% 
  mutate(date_var      = as.Date(date_var, format = "MM/DD/YYYY"),  # format should be the format of the raw data
         numeric_var   = as.numeric(numeric_var),
         character_var = as.character(character_var),
         factor_var    = factor(factor_var, levels = c(), labels = c())
         )

Pre-checks and errors

Pre-checks and errors

First we run some checks on the classes of important columns.

The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

However, if we try to classify the date_onset column as date, we would get an error. Use table() or sort or another method to examine all the values and identify different one. For example in our dataset we see that we see that one date_onset value was entered in a different format (15th April 2014) than all the other values!

## 
## 15th April 2014      2012-04-21      2012-05-09      2012-05-14      2012-05-27      2012-06-22 
##               1               1               1               1               2               1

Before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data, or, we can do in the cleaning pipeline via mutate() and recode(). This must be done before the commands to convert to class Date. (LINK TO DATE SECTION).

# fix incorrect values                 # old value       # new value
mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15"))

The mutate() line can be read as: “mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this for recoding.

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.

Modify multiple columns

Modify multiple columns

You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R
  • Note that the functions in across() are written without the empty parentheses ()
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Below, the described cleaning steps are added to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 

  
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) 

Add columns and rows

Add columns and rows

See the tabs below to add columns and rows

Add columns

Add columns

mutate()

mutate()

We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use mutate() or the base R style to create a new column (see below).

The verb mutate() is used to add a new column, or to modify an existing one. Below is an example of creating a new columns with mutate(). The syntax is: new_column_name = value or function.

linelist <- linelist %>% 
  mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset))

It is best practice to separate each new column with a comma and new line. Below, some practice columns are created:

linelist <- linelist %>%                       # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,             # new column = duplicate/copy another column
         new_var_static = 7,                   # new column = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
         new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
         ) 

Scroll to the right to see the new columns (first 50 rows):

TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.

New columns using base R

New columns using base R

To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:

linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$age

Add rows

Add rows

TO DO

Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666, case_id = "abc", generation = 4, `infection date` = as.Date("2020-10-10"), .before = 2)

use .before and .after. .before = 3 will put it before the 3rd row. Default is to add it to the end. columns not specified will be let empty. The new row number may look strange (“…23”) but the row numbers have changed. So if using the command twice examine/test carefully.

If your class is off you will see an error like this: Error: Can’t combine ..1$infection date and ..2$infection date . (for a date value remember to wrap the date in the functionas.Date() like as.Date("2020-10-10"))

New columns using grouped values

New columns using grouped values

CAUTION: This tab may follow from previous tabs.

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

#Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as #soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Add to pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 

  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

  # create column: delay to hospitalisation
  mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset))

Recoding values

Recoding values

For example, in linelist the values in the column “hospital” must be cleaned. There are several different spellings (often the word “Hospital” is missing an “s” and is written “Hopital”), and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                      Central Hopital                     Central Hospital                           Hospital A 
##                                   11                                  454                                  289 
##                           Hospital B                     Military Hopital                    Military Hospital 
##                                  289                                   31                                  802 
##                     Mitylira Hopital                    Mitylira Hospital                                Other 
##                                    1                                   82                                  902 
##                         Port Hopital                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                   47                                 1760                                  426 
##   St. Marks Maternity Hopital (SMMH)                                 <NA> 
##                                   11                                 1504

Manual recoding

Manual recoding

These tabs demonstrate re-coding values manually b providing specific spellings to be corrected:

  • Using replace() for specific rows
  • Using recode() for entire columns
  • Using base R

replace()

To manually change values for specific rows within a dataframe (from within a pipe chain), use replace() within mutate().
Use a logic condition to specify rows, for example an ID value of one row. The general syntax is:

mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).

In the first example below, the gender value, in the row where id is “2195”, is changed to “Female”.

# Example: change gender of one specific observation to "Female" 
mutate(gender = replace(gender, id == "2195", "Female")

# Example: chance gender of one specific observation to NA 
mutate(gender = replace(gender, id == "2195", NA)

recode()

recode()

To change spellings manually, one-by-one, you can use the recode() function *within the mutate() function. The code is saying that the column “hospital” should be defined as the current column “hospital”, but with certain changes (the syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      ))

Now we see the values in the hospital column have been corrected:

table(linelist$hospital, useNA = "always")
## 
##                     Central Hospital                           Hospital A                           Hospital B 
##                                  465                                  289                                  289 
##                    Military Hospital                                Other                        Port Hospital 
##                                  916                                  902                                 1807 
## St. Mark's Maternity Hospital (SMMH)                                 <NA> 
##                                  437                                 1504

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

base R

base R

If necessary, you make manual changes to a specific value in a dataframe by referencing the row number of case ID. But remember it is better if you can make these changes permanently in the underlying data!

Here is a fake example. It reads as “Change the value of the dataframe linelist‘s column onset_date (for the row where linelist’s column case_id has the value ’9d4019’) to as.Date("2020-10-24")”.

linelist$date_onset[linelist$case_id == "9d4019"] <- as.Date("2020-10-24")

Recoding by logic

Recoding by logic

These tabs demonstrate re-coding values in a column using logic and conditions:

  • Using case_when()
  • Using ifelse() and if_else()
  • Using special dplyr recoding functions like:
    • replace_na()
    • na_if()
    • coalesnce()

case_when()

case_when()

If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!

Note that all Right-hand side (RHS) inputs must be of the same class (e.g. character, numeric, logical). Notice the use of the special value NA_real_ instead of just NA.

linelist <- linelist %>% 
  dplyr::mutate(age_years = case_when(
            age_unit == "years"  ~ age,       # if age is given in years
            age_unit == "months" ~ age/12,    # if age is given in months
            is.na(age_unit)      ~ age,       # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # Any other circumstance

ifelse() and if_else()

ifelse() and if_else()

For simple uses of logical re-coding or new variable creationgyou can use ifelse() or if_else(). Though in most cases it is better to use case_when().

These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if condition evaluates to TRUE, value if condition evaluates to FALSE). If used in a mutate(), each row is evaluated. if_else() is a special version from dplyr that handles dates in the condition.

It can be tempting to string together many ifelse commands… resist this and use case_when() instead! It is much more simple, easier to read, and easier to identify errors.

IMAGE of ifelse string with X across is.

You can reference other columns with the ifelse() function within mutate():

Example of ifelse():

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

Example of if_else() (using dates): Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Note: If you want to alternate a value used in your code based on other circumstances, consider using switch() from base R. For example if… TO DO. See the section on using switch() in the page on R interactive console.

Recoding using special dplyr functions

Recoding using special dplyr functions

Using replace_na()

To change missing values (NA) to a specific character value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

Using na_if()

Likewise you can quickly convert a specific character value to NA using na_if(). The command below is the opposite of the one above. It converts any values of “Missing” to NA.

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

Using coalesce()

This dplyr function finds the first non-missing value at each position. So, you provide it with columns and for each row it will fill the value with the first non-missing value in the columns you provided.

For example, you might use thiscoalesce()` create a “location” variable from hypothetical variables “patient_residence” and “reporting_jurisdiction”, where you prioritize patient residence information, if it exists.

linelist <- linelist %>% 
  mutate(location = coalesce(patient_residence, reporting_jurisdiction))

TO DO lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

Recoding using cleaning dictionaries

Recoding using cleaning dictionaries

CAUTION: This tab may follow from previous tabs.

## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(column %in% c(names(mll_raw), ".global"))

## define columns that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean columns using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Add to pipe chain

Add to pipe chain

Here we add the described cleaning steps to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
  
    # create column: delay to hospitalisation
  mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

Filter rows

Filter rows

CAUTION: This tab may follow from previous tabs.

A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.

The tabs below show how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R

A simple filter()

A simple filter()

This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.

In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.

Before the filter is applied, the number of rows in linelist is 6609.

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

After the filter is applied, the number of rows in linelist is 6605.

A complex filter()

A complex filter()

A more complex example using filter():

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

How filters handle missing numeric and date values

Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

Design the filter

Examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2012 2013 2014 2015 <NA>
##   Central Hospital                        0    0  351   99   15
##   Hospital A                            231   41    0    0   16
##   Hospital B                            228   40    0    0   21
##   Military Hospital                       0    0  679  204   33
##   Missing                                 0    0 1119  322   60
##   Other                                   0    0  685  173   44
##   Port Hospital                           7    2 1368  344   86
##   St. Mark's Maternity Hospital (SMMH)    0    0  330   93   14
##   <NA>                                    0    0    0    0    0

What other criteria can we filter on to remove the first outbreak from the dataset? We see that:

  • The first epidemic occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
  • Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.

We want to exclude:

  • The 586 rows with onset in 2012 and 2013 at either hospital A, B, or Port:
    • Exclude the 549 rows with onset in 2012 and 2013
    • Exclude the 37 rows from Hospitals A & B with missing onset dates
    • Do not exclude the 252 other rows with missing onset dates.

We start with a linelist of nrow(linelist). Here is our filter statement:

linelist <- linelist %>% 
  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6019

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2014 2015 <NA>
##   Central Hospital                      351   99   15
##   Military Hospital                     679  204   33
##   Missing                              1119  322   60
##   Other                                 685  173   44
##   Port Hospital                        1368  344   86
##   St. Mark's Maternity Hospital (SMMH)  330   93   14
##   <NA>                                    0    0    0

Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.

Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete. This is true. But for pdate_onset is used for purposes of a complex filter example.

Filter as a stand-alone command

Filter as a stand-alone command

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

TIP: Use bracket-subset syntax with View() to quickly review a few records.

Filtering to quickly review data

Filtering to quickly review data

This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:

View the linelist in the viewer panel:

View(linelist)

View specific data for three cases:

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Note: the above command can also be written with dplyr verbs filter() and select() as below:

View(linelist %>%
       filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
       select(date_onset, date_hospitalisation))

Add to pipe chain

Add to pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
  
    
    # create column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years"  ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit)      ~ age,
          TRUE                 ~ NA_real_)) %>% 
    
  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

Numeric categories

Numeric categories

CAUTION: This tab may follow from previous tabs.

Special approaches for creating numeric categories

Common examples include age categories, groups of lab values, etc.

There are several ways to create categories of a numeric column such as age. Here we will discuss:

  • age_categories(), from the epikit package
  • cut(), from base R
  • using percentiles to break your numbers
  • natural break points… ? TO DO
  • case_when()

Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years).

For this example we will create an age_cat column using the age_years column.

#check the class of the linelist variable age
class(linelist$age_years)
## [1] "numeric"

age_categories()**

age_categories()

With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this can be applied to non-age numeric variables too). The output is an ordered factor.

The break values specified are included in the higher group, that is groups are open on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.

Other optional arguments:

  • lower = Default is 0). The lowest number you want considered.
  • upper = The highest number you want considered.
  • by = The number of years between groups.
  • separator = Default is “-”. Character between ages in labels.
  • ceiling = Default FALSE. If TRUE, the highest break value is a ceiling and a category “XX+” is not included. Any values above highest break or upper (if defined) are categorized as NA.

See the function’s Help page for more details (enter ?age_categories in the R console).

library(epikit)

# Simple example
################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years,
                                  breakers = c(0, 5, 10, 15, 20, 30, 50, 70)))
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1097  1177  1006   855  1108   639    46     0    91
# With ceiling set to TRUE
##########################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years, 
                                  breakers = c(0, 5, 10, 15, 20, 30, 50, 70),
                                  upper = max(linelist$age_years, na.rm=T),
                                  ceiling = TRUE)) # 70 is the ceiling
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-70  <NA> 
##  1097  1177  1006   855  1108   639    46    91
# Include upper ends for the same categories
############################################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years, 
                                  upper = max(linelist$age_years, na.rm=T),
                                  breakers = c(0, 6, 11, 16, 21, 31, 51, 71, 76)))
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-5  6-10 11-15 16-20 21-30 31-50 51-70 71-75   76+  <NA> 
##  1338  1160   976   809  1029   579    37     0     0    91

cut()

cut()

You can use the base R function cut(), which creates categories from a numeric variable. The differences from age_categories() are:

  • You do not need to install/load another package
  • You can specify whether groups are open/closed on the right/left
  • You must provide labels yourself (and ensure they are accurate to the groups)
  • If you want 0 included in the lowest group you must specify this

The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is an ordered factor.

If used within mutate() (a dplyr verb) it is not necessary to specify the dataframe before the column name (e.g. linelist$age_years).

Simple cut() example

Simple cut() example

Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points. The example below replicates the first age_categories() example.

  • Specify numeric vector of break points c(0, 5, 10, 15, ...)
  • Default behavior for cut() is that lower break values are excluded from each category, and upper break values are included. This is the opposite behavior from the age_categories() function.
  • Include 0 in the lowest category by adding include.lowest = TRUE
  • Add a vector of customized labels using the labels = argument
  • Check your work with cross-tabulation of the numeric and category columns - be aware of missing values
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,                                       # numeric column
                        breaks = c(0, 5, 10, 15, 20, 30, 50, 70,        # break points...
                                   max(linelist$age_years, na.rm=T)),   # ... with dynamic last break as column max value
                        right = TRUE,                                   # lower breaks included and upper excluded [a,b)
                        include.lowest = TRUE,                          # 0 included in lowest category
                        labels = c("0-4", "5-9", "10-14", "15-19",      # manual labels - be careful!
                                   "20-29", "30-49", "50-69", "70+")))       

table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1338  1160   976   809  1029   579    37     0    91

cut() details

cut() details

Below is a detailed description of the behavior of using cut() to make the age_cat column. Key points:

  • Inclusion/exclusion behavior of break points
  • Custom category labels
  • Handling missing values
  • Check your work!

The most simple command of cut() applied to age_years to make the new variable age_cat is below:

# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years, breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100)))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    (0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100]     <NA> 
##     1223     1160      976      809     1029      579       37        0      206
  • By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). Reverse this behavior by providing the right = TRUE argument.

  • Thus, by default “0” values are excluded from the lowest group, and categorized as NA. “0” values could be infants coded as age 0. To change this add the argument include.lowest = TRUE. Then, any “0” values are included in the lowest group. The automatically-generated label for the lowest category will change from “(0,B]” to “[0,B]”, which signifies that 0 values are included.

  • Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 15-20).

# Cross tabulation of the numeric and category columns. 
table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
      "Categories"     = linelist$age_cat,
      useNA = "always")                        # don't forget to examine NA values
##                     Categories
## Numeric Values       (0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
##   0                      0      0       0       0       0       0       0        0  115
##   0.0833333333333333     1      0       0       0       0       0       0        0    0
##   0.166666666666667      1      0       0       0       0       0       0        0    0
##   0.333333333333333      3      0       0       0       0       0       0        0    0
##   0.416666666666667      1      0       0       0       0       0       0        0    0
##   0.5                    2      0       0       0       0       0       0        0    0
##   0.583333333333333      2      0       0       0       0       0       0        0    0
##   0.666666666666667      6      0       0       0       0       0       0        0    0
##   0.75                   1      0       0       0       0       0       0        0    0
##   0.833333333333333      2      0       0       0       0       0       0        0    0
##   1                    259      0       0       0       0       0       0        0    0
##   1.5                    3      0       0       0       0       0       0        0    0
##   2                    250      0       0       0       0       0       0        0    0
##   3                    229      0       0       0       0       0       0        0    0
##   4                    222      0       0       0       0       0       0        0    0
##   5                    241      0       0       0       0       0       0        0    0
##   6                      0    228       0       0       0       0       0        0    0
##   7                      0    231       0       0       0       0       0        0    0
##   8                      0    231       0       0       0       0       0        0    0
##   9                      0    246       0       0       0       0       0        0    0
##   10                     0    224       0       0       0       0       0        0    0
##   11                     0      0     208       0       0       0       0        0    0
##   12                     0      0     209       0       0       0       0        0    0
##   13                     0      0     191       0       0       0       0        0    0
##   14                     0      0     174       0       0       0       0        0    0
##   15                     0      0     194       0       0       0       0        0    0
##   16                     0      0       0     198       0       0       0        0    0
##   17                     0      0       0     179       0       0       0        0    0
##   18                     0      0       0     141       0       0       0        0    0
##   19                     0      0       0     143       0       0       0        0    0
##   20                     0      0       0     148       0       0       0        0    0
##   21                     0      0       0       0     137       0       0        0    0
##   22                     0      0       0       0     129       0       0        0    0
##   23                     0      0       0       0      99       0       0        0    0
##   24                     0      0       0       0     101       0       0        0    0
##   25                     0      0       0       0     108       0       0        0    0
##   26                     0      0       0       0     111       0       0        0    0
##   27                     0      0       0       0      95       0       0        0    0
##   28                     0      0       0       0      97       0       0        0    0
##   29                     0      0       0       0      83       0       0        0    0
##   30                     0      0       0       0      69       0       0        0    0
##   31                     0      0       0       0       0      57       0        0    0
##   32                     0      0       0       0       0      76       0        0    0
##   33                     0      0       0       0       0      71       0        0    0
##   34                     0      0       0       0       0      28       0        0    0
##   35                     0      0       0       0       0      43       0        0    0
##   36                     0      0       0       0       0      46       0        0    0
##   37                     0      0       0       0       0      44       0        0    0
##   38                     0      0       0       0       0      30       0        0    0
##   39                     0      0       0       0       0      20       0        0    0
##   40                     0      0       0       0       0      16       0        0    0
##   41                     0      0       0       0       0      24       0        0    0
##   42                     0      0       0       0       0      30       0        0    0
##   43                     0      0       0       0       0      15       0        0    0
##   44                     0      0       0       0       0      16       0        0    0
##   45                     0      0       0       0       0      17       0        0    0
##   46                     0      0       0       0       0      11       0        0    0
##   47                     0      0       0       0       0      11       0        0    0
##   48                     0      0       0       0       0      12       0        0    0
##   49                     0      0       0       0       0       3       0        0    0
##   50                     0      0       0       0       0       9       0        0    0
##   51                     0      0       0       0       0       0       4        0    0
##   52                     0      0       0       0       0       0       6        0    0
##   53                     0      0       0       0       0       0       3        0    0
##   54                     0      0       0       0       0       0       4        0    0
##   55                     0      0       0       0       0       0       4        0    0
##   56                     0      0       0       0       0       0       6        0    0
##   57                     0      0       0       0       0       0       2        0    0
##   58                     0      0       0       0       0       0       2        0    0
##   59                     0      0       0       0       0       0       1        0    0
##   61                     0      0       0       0       0       0       1        0    0
##   63                     0      0       0       0       0       0       1        0    0
##   65                     0      0       0       0       0       0       1        0    0
##   66                     0      0       0       0       0       0       1        0    0
##   67                     0      0       0       0       0       0       1        0    0
##   <NA>                   0      0       0       0       0       0       0        0   91

Read more about cut() in its Help page by entering ?cut in the R console.

Reversing break inclusion behavior in cut()

Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.

NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the include.lowest will now apply to the highest break point value and category, not the lowest.

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),     # same breaks
                          right = FALSE,                                     # include each *lower* break point            
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")))  # now the labels must change

table(linelist$age_cat, useNA = "always")
## 
##    0-4    5-9  10-14  15-19  20-29  30-49  50-69 70-100   <NA> 
##   1097   1177   1006    855   1108    639     46      0     91

Re-labeling NA values with cut()

Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified the new column age_cat as a Factor (a rigid column class with specific value labels).

First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (e.g. “Missing”). Otherwise you will encounter an error. Then, use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These steps can be combined into one step, as shown below.

Note that Missing has been added, but the order of the categories is now wrong (alphabetical).

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"))


table(linelist$age_cat, useNA = "always")
## 
##     0-4   10-14   15-19   20-29   30-49     5-9   50-69 Missing    <NA> 
##    1097    1006     855    1108     639    1177      46      91       0

To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"),
         
         # re-classify age_cat as Factor, with correct level order and new "Missing" level
         age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
                                              "30-49", "50-69", "70-100", "Missing")))    
  

table(linelist$age_cat, useNA = "always")
## 
##     0-4     5-9   10-14   15-19   20-29   30-49   50-69  70-100 Missing    <NA> 
##    1097    1177    1006     855    1108     639      46       0      91       0

If you want a fast way to make breaks and labels, you can use something like below (adjust to your specific situation). See the page on using seq() and rep() and c() TO DO

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

# # Use them in the cut() command
# cut(linelist$age, breaks = age_seq, labels = age_labels)

case_when()

case_when()

The dplyr function case_when() can also be used to create numeric categories.

  • Allows explicit setting of break point inclusion/exclusion
  • Allows designation of label for NA values in one step
  • More complicated code, arguably more prone to error
  • Allow more flexibility to include other variables in the logic

If using case_when() please review the in-depth page on it, as the logic and order of assignment are important understand to avoid errors.

CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (“Missing”, or the special NA_character_ instead of NA).

You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.

linelist <- linelist %>% 
  mutate(age_cat = factor(case_when(
          # provide the case_when logic and outcomes
          age_years >= 0 & age_years < 5     ~ "0-4",          # logic by age_year value
          age_years >= 5 & age_years < 10    ~ "5-9",
          age_years >= 10 & age_years < 15   ~ "10-14",
          age_years >= 15 & age_years < 20   ~ "15-19",
          age_years >= 20 & age_years < 30   ~ "20-29",
          age_years >= 30 & age_years < 50   ~ "30-49",
          age_years >= 50 & age_years < 70   ~ "50-69",
          age_years >= 45 & age_years <= 100 ~ "70-100",
          is.na(age_years)                   ~ "Missing",  # if age_years is missing
          TRUE                               ~ "Check value"   # catch-all alarm to trigger review
          ), levels = c("0-4","5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100", "Missing", "Check value"))
         )


table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49       50-69      70-100     Missing Check value        <NA> 
##        1097        1177        1006         855        1108         639          46           0          91           0           0

Add to pipe chain

Add to pipe chain

Below, code to create two categorical age columns is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
  
    
    # create column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
    
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))

rowwise() dplyr()

rowwise() dplyr

https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html

linelist <- linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"))

Transforming multiple variables at once

Transforming multiple variables at once

CAUTION: This tab may follow from previous tabs.

A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).

across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise(). Here are some examples to get started.

Example of how one would change all columns to character class

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(everything(), as.character))

Change only numeric columns

Here are a few online resources on using across(): Hadley Wickham’s thoughts/rationale

Deduplication

Deduplication

CAUTION: This tab may follow from previous tabs.

The package dplyr offers the distinct() function to reduce the dataframe to only unique rows - removing duplicates.
In this case we just want to remove rows that are complete duplicates, so we just add the simple command distinct().

For more complex deduplications see the page on deduplicating.

We begin with 6019 rows in linelist.

linelist <- linelist %>% 
  distinct()

After deduplication there are 5889 rows.

Below, the distinct() command is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x28)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # create column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
    
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>% 
  
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>% 
  
    distinct()

Working with Dates

Overview

Working with dates in R is notoriously difficult when compared to other object classes. R often interprets dates as character objects - this means they cannot be used for general date operations such as making time series and calculating time intervals. To make matters more difficult, there are many date formats, some of which can be confused for other formats. Luckily, dates can be wrangled easily with practice, and with a set of helpful packages.

Dates in R are their own class of object - the Date class. It should be noted that there is also a class that stores objects with date and time. Date time objects are formally referred to as and/or POSIXt, POSIXct, and/or POSIXlt classes (the difference isn’t important). These objects are informally referred to as datetime classes.

You can get the system date or system datetime by doing the following:

# get the system date - this is a DATE class
Sys.Date()
## [1] "2021-01-31"
# get the system time - this is a DATETIME class
Sys.time()
## [1] "2021-01-31 18:21:31 EST"
  • It is important to make R recognize when a variable contains dates.
  • Dates are an object class and can be tricky to work with.
  • Here we present several ways to convert date variables to Date class.

Packages

The following packages are recommended for working with dates:

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(aweek,      # flexibly converts dates to weeks, and vis-versa
               lubridate,  # for conversions to months, years, etc.
               linelist,   # function to guess messy dates
               ISOweek)    # another option for creating weeks

Converting objects to Date class

The standard, base R function to convert an object or variable to class Date is as.Date() (note capitalization).

as.Date() requires that the user specify the existing* format of the date*, so it can understand, convert, and store each element (day, month, year, etc.) correctly. Read more online about as.Date().

If used on a variable, as.Date() therefore requires that all the character date values be in the same format before converting. If your data are messy, try cleaning them or consider using guess_dates() from the linelist package.

It can be easiest to first convert the variable to character class, and then convert to date class:

  1. Turn the variable into character values using the function as.character()
linelist_cleaned$date_of_onset <- as.character(linelist_cleaned$date_of_onset)
  1. Convert the variable from character values into date values, using the function as.Date()
    (note the capital “D”)
  • Within the as.Date() function, you must use the format= argument to tell R the current format of the date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (YYYY-MM-DD or YYYY/MM/DD) the format= argument is not necessary.

    • The codes are:
      %d = Day # (of the month e.g. 16, 17, 18…)
      %a = abbreviated weekday (Mon, Tues, Wed, etc.)
      %A = full weekday (Monday, Tuesday, etc.)
      %m = # of month (e.g. 01, 02, 03, 04)
      %b = abbreviated month (Jan, Feb, etc.)
      %B = Full Month (January, February, etc.)
      %y = 2-digit year (e.g. 89)
      %Y = 4-digit year (e.g. 1989)

For example, if your character dates are in the format DD/MM/YYYY, like “24/04/1968”, then your command to turn the values into dates will be as below. Putting the format in quotation marks is necessary.

linelist_cleaned$date_of_onset <- as.Date(linelist_cleaned$date_of_onset, format = "%d/%m/%Y")

TIP: The format = argument is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.

TIP:Be sure that in the format = argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.

Conveting character objects to dates can be made far easier by using the lubridate package. The lubridate package is a tidyverse package designed to make working with dates and time more simple and consistent than in base R. For these reasons, lubridate is often considered the gold-standard package for dates and time, and is recommended whenever working with them.

The lubridate package provides a number of different helper functions designed to convert character objects to dates in an intuitive, and more lenient way than specifying the format in as.Date(). These functions are specific to the rough date format, but allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January) - they are named after abbreviations of date formats.

# load packages 
library(lubridate)

# read date in year-month-day format
ymd("2020-10-11")
## [1] "2020-10-11"
ymd("20201011")
## [1] "2020-10-11"
# read date in month-day-year format
mdy("10/11/2020")
## [1] "2020-10-11"
mdy("Oct 11 20")
## [1] "2020-10-11"
# read date in day-month-year format
dmy("11 10 2020")
## [1] "2020-10-11"
dmy("11 October 2020")
## [1] "2020-10-11"

If using piping and the tidyverse, the converting a character column to dates might look like this:

linelist_cleaned <- linelist_cleaned %>%
  mutate(date_of_onset = lubridate::dmy(date_of_onset))

Once complete, you can run a command to verify the class of the variable

# Check the class of the variable
class(linelist_cleaned$date_of_onset)  

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

Converting to datetime classes

As previously mentioned, R also supports a datetime class - a variable that contains date and time information. As with the Date class, these often need to be converted from character objects to datetime objects.

A standard datetime object is formatted with the date first, which is followed by a time component - for example 01 Jan 2020, 16:30. As with dates, there are many ways this can be formatted, and there are numerous levels of precision (hours, minutes, seconds) that can be supplied. Luckily, lubridate helper functions also exist to help convert these strings to datetime objects. These functions are the same as the date helper functions, with _h (only hours supplied), _hm (hours and minutes supplied), or _hms (hours, minutes, and seconds supplied) appended to the end (e.g. dmy_hms()). These can be used as shown:

# convert datetime with only hours to datetime object
ymd_h("2020-01-01 16hrs")
## [1] "2020-01-01 16:00:00 UTC"
ymd_h("2020-01-01 4PM")
## [1] "2020-01-01 16:00:00 UTC"
# convert datetime with hours and minutes to datetime object
dmy_hm("Jan 1st 2020 16:20")
## Warning: All formats failed to parse. No formats found.
## [1] NA
# convert datetime with hours, minutes, and seconds to datetime object
mdy_hms("01 January 20, 16:20:40")
## Warning: All formats failed to parse. No formats found.
## [1] NA
# you can supply time zone but it is ignored
mdy_hms("01 January 20, 16:20:40 PST")
## Warning: All formats failed to parse. No formats found.
## [1] NA

When working with a linelist, time and date columns can be combined to create a datetime column using these functions:

# time_admission is a variable in hours:minutes
linelist_cleaned <- linelist_cleaned %>%
  # assume that when time of admission is not given, it the median admission time
  mutate(
    time_admission_clean = ifelse(
      is.na(time_admission),
      median(time_admission),
      time_admission
  ) %>%
  # use paste0 to combine two columns to create a character vector, and use ymd_hm() to convert to datetime
  mutate(
    date_time_of_admission = paste0(
      date_hospitalisation, time_admission_clean, sep = " "
    ) %>% ymd_hm()
  )

lubridate

lubridate can also be used for a variety of other functions, such as extracting aspects of a date/datetime, performing date arithmetic, or calculating date intervals

  # extract the month from this date
  
  example_date <- ymd("2020-03-01")
  
  # extract the month and year from this date
  month(example_date)
## [1] 3
  year(example_date)
## [1] 2020
  # get the epiweek of this date (this will be expanded later)
  epiweek(example_date)
## [1] 10
  # get the day of the week for this date (this will be expanded later)
  wday(example_date)
## [1] 1
  # add 3 days to this date
  example_date + days(3)
## [1] "2020-03-04"
  # add 7 weeks and subtract two days from this date
  example_date + weeks(7) - days(2)
## [1] "2020-04-17"
  # find the interval between this date and Feb 20 2020 
  
  example_date - ymd("2020-02-20")
## Time difference of 10 days

This can all be brought together to work with data - for example:

library(lubridate)

linelist_cleaned <- linelist_cleaned %>%
    # convert date of onset from character to date objects by specifying dmy format
    mutate(date_of_onset = dmy(date_of_onset),
           date_of_hospitalisation = dmy(date_of_hospitalisation)) %>%
    # filter out all cases without onset in march
    filter(month(date_of_onset) == 3) %>%
    # find the difference in days between onset and hospitalisation
    mutate(onset_to_hosp_days = date_of_hospitalisation - date_of_onset)

guess_dates()

The function guess_dates() attempts to read a “messy” date variable containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(), which is in the linelist package.

For example: guess_dates would see the following dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them in the class Date to: 2018-01-03, 1982-03-07, and 1985-08-20.

linelist::guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85")) # guess_dates() not yet available on CRAN for R 4.0.2
                                                                  # try install via devtools::install_github("reconhub/linelist")

Some optional arguments for guess_dates() that you might include are:

  • error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)
  • last_date - the last valid date (defaults to current date)
  • first_date - the first valid date. Defaults to fifty years before the last_date.
# An example using guess_dates on the variable dtdeath
data_cleaned <- data %>% 
  mutate(
    dtdeath = linelist::guess_dates(
      dtdeath, error_tolerance = 0.1, first_date = "2016-01-01"
    )

Excel Dates

Excel stores dates as the number of days since December 30, 1899. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use the as.Date() or as_date() function to convert, but instead of supplying a format as above, supply an origin date. This will not work if the excel date is read as a character type, so be sure to ensure the date is a numeric class (or convert it to one)!

NOTE: You should provide the origin date in R’s default date format ("YYYY-MM-DD").

library(lubridate)
library(dplyr)

# An example of providing the Excel 'origin date' when converting Excel number dates
data_cleaned <- data %>% 
  mutate(date_of_onset = as_date(as.double(date_of_onset), origin = "1899-12-30"))

How dates are displayed

Once dates are the correct class, you often want them to display differently (e.g. in a plot, graph, or table). For example, to display as “Monday 05 Jan” instead of 2018-01-05. You can do this with the function format(), which works in a similar way as as.Date(). Read more in this online tutorial. Remember that the output from format() is a character type, so is generally used for display purposes only!

%d = Day # (of the month e.g. 16, 17, 18…) %a = abbreviated weekday (Mon, Tues, Wed, etc.)
%A = full weekday (Monday, Tuesday, etc.)
%m = # of month (e.g. 01, 02, 03, 04)
%b = abbreviated month (Jan, Feb, etc.)
%B = Full Month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)

An example of formatting today’s date:

# today's date, with formatting
format(Sys.Date(), format = "%d %B %Y")
## [1] "31 January 2021"
# easy way to get full date and time (no formatting)
date()
## [1] "Sun Jan 31 18:21:32 2021"
# formatted date, time, and time zone (using paste0() function)
paste0(
  format(Sys.Date(), format = "%A, %b %d '%y, %z  %Z, "), 
  format(Sys.time(), format = "%H:%M:%S")
)
## [1] "Sunday, Jan 31 '21, +0000  UTC, 18:21:32"

Calculating distance between dates

The difference between dates can be calculated by:

  1. Correctly formating both date variable as class date (see instructions above)
  2. Creating a new variable that is defined as one date variable subtracted from the other
  3. Converting the result to numeric class (default is class “datediff”). This ensures that subsequent mathematical calculations can be performed.
# define variables as date classes
date_of_onset <- ymd("2020-03-16")
date_lab_confirmation <- ymd("2020-03-20")

# find the delay between onset and lab confirmation
days_to_lab_conf <- as.double(date_lab_confirmation - date_of_onset)
days_to_lab_conf
## [1] 4

In a dataframe format (i.e. when working with a linelist), if either of the above dates is missing, the operation will fail for that row. This will result in an NA instead of a numeric value. When using this column for calculations, be sure to set the na.rm option to TRUE. For example:

# add a new column
# calculating the number of days between symptom onset and patient outcome
linelist_delay <- linelist_cleaned %>%
  mutate(
    days_onset_to_outcome = as.double(date_of_outcome - date_of_onset)
  )

# calculate the median number of days to outcome for all cases where data are available
med_days_outcome <- median(linelist_delay$dats_onset_to_outcome, na.rm = T)

# often this operation might be done only on a subset of data cases, e.g. those who died
# this is easy to look at and will be explained later in the handbook

Converting dates/time zones

When data is present in different time time zones, it can often be important to standardise this data in a unified time zone. This can present a further challenge, as the time zone component of data must be coded manually in most cases.

In R, each datetime object has a timezone component. By default, all datetime objects will carry the local time zone for the computer being used - this is generally specific to a location rather than a named timezone, as time zones will often change in locations due to daylight savings time. It is not possible to accurately compensate for time zones without a time component of a date, as the event a date variable represents cannot be attributed to a specific time, and therefore time shifts measured in hours cannot be reasonably accounted for.

To deal with time zones, there are a number of helper functions in lubridate that can be used to change the time zone of a datetime object from the local time zone to a different time zone. Time zones are set by attributing a valid tz database time zone to the datetime object. A list of these can be found here - if the location you are using data from is not on this list, nearby large cities in the time zone are available and serve the same purpose.

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

# assign the current time to a variable
time_now <- Sys.time()
time_now
## [1] "2021-01-31 18:21:32 EST"
# use with_tz() to assign a new timezone to the variable, while CHANGING the clock time
time_london_real <- with_tz(time_now, "Europe/London")

# use force_tz() to assign a new timezone to the variable, while KEEPING the clock time
time_london_local <- force_tz(time_now, "Europe/London")


# note that as long as the computer that was used to run this code is NOT set to London time, there will be a difference in the times (the number of hours difference from the computers time zone to london)

time_london_real - time_london_local
## Time difference of 5 hours

This may seem largely abstract, and is often not needed if the user isn’t working across time zones. One simple example of its implementation is:

# TODO add when time variable is here
# set the time variable to time zone for ebola outbreak 

# "Africa/Lubumbashi" is the time zone for eastern DRC/Kivu Nord

Epidemiological weeks

The templates use the very flexible package aweek to set epidemiological weeks. You can read more about it on the RECON website

Dates in Epicurves

See the section on epicurves.

Dates miscellaneous

  • Sys.Date( ) returns the current date of your computer
  • Sys.Time() returns the current time of your computer
  • date() returns the current date and time.

Missing data

Overview

This page will cover:

  1. Useful functions for assessing missingness
  2. Assess missingness in a dataframe
  3. Plotting missingness over time
  4. Handling how NA is displayed in plots
  5. Imputation

Useful functions

Useful functions

The following are useful functions when assessing or handling missing values:

is.na() and !is.na()

To identify missing values use is.na() or its opposite (with ! in front). Both are from base R.
These return a logical vector (TRUE or FALSE). Remember that you can sum() the resulting vector to count the number TRUE, e.g. sum(is.na(linelist$date_outcome)).

my_vector <- c(1, 4, 56, NA, 5, NA, 22)
is.na(my_vector)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
!is.na(my_vector)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

na.omit()

This function, if applied to a dataframe, will remove rows with any missing values. It is also from base R.
If applied to a vector, it will remove NA values from the vector it is applied to. For example:

sum(na.omit(my_vector))
## [1] 88

na.rm = TRUE

Often a mathematical function will by default include NA in calculations, which results in the function returning NA (this is designed intentionally, to make you aware that you have missing data).
You can usually avoid this by removing missing values from the calculation, by including the argument na.rm = TRUE (na.rm stands for “remove NA”).

mean(my_vector)
## [1] NA
mean(my_vector, na.rm = TRUE)
## [1] 17.6

Assess a dataframe

Assess a dataframe

Missingness over time

Missingness over time

Change in percent of weekly observations that are missing in X column.

outcome_missing <- linelist %>%
  mutate(week = lubridate::floor_date(date_onset, "week")) %>% 
  group_by(week) %>% 
  summarize(n_obs = n(),
            outcome_missing = sum(is.na(outcome) | outcome == ""), # include "" because this is character
            outcome_p_miss = outcome_missing / n_obs) %>%
  reshape2::melt(id.vars = c("week")) %>%
  filter(grepl("_p_", variable))

Then we plot the proportion missing as a line, by week

ggplot(data = outcome_missing)+
    geom_line(aes(x = week, y = value, group = variable, color = variable), size = 2, stat = "identity")+
    labs(title = "Weekly missingness in 'Outcome'",
         x = "Week",
         y = "Proportion missing") + 
    scale_color_discrete(name = "", labels = c("Weekly proportion of missing outcomes"))+
    scale_y_continuous(breaks = c(seq(0,1,0.1)))+
  theme_minimal()+
  theme(
    legend.position = "bottom"
  )
## Warning: Removed 1 row(s) containing missing values (geom_path).

NA in plots

NA in plots

Imputation

Imputation

Resources

Resources

Grouping/aggregating data

This page reviews how to group and aggregate data for descriptive analysis. It makes use of tidyverse packages for common and easy-to-use functions.

Overview

Before doing descriptive analyses, it will almost always be a necessary to first group your data and summarize it across these groups (whether it be by time period, place, or a relevant categorical variable) since most often summary statistics across these groups are more meaningful. Luckily, tidyverse makes this really easy through the group_by function.

This page will how to perform these grouping operations

  • Fast & easy using group_by() command in dplyr or
  • Base R aggregate() command

.drop=F in group_by() command

Preparation

For this tab we use the linelist dataset that is cleaned in the Cleaning tab.

Load packages

Ensure tidyverse is installed, which includes dplyr for group_by

pacman::p_load(rio,       # to import data
               here,      # to locate files
               tidyverse  # to clean, handle, and plot the data (includes dplyr!)
)

Load the data

linelist <- rio::import(here("data", "linelist_cleaned.xlsx"))

group_by()

You can perform different operations after first grouping by one variable, say, outcome. This provides instruction that any calculations should then be performed within the context of the grouped columns. You can group by 1 or more columns.

First, let’s convert outcome to a factor to make resulting plots easier to work with.

linelist <- linelist %>%
  mutate(outcome = as.factor(outcome))

Below we will walk through a few examples of group_by functionalities:

tally() gives you a simple count of rows across each category.

count_by_outcome <- linelist %>%
  group_by(outcome) %>%
  tally()

Here we see that there were 2 633 deaths, 2 026 recoveries, and 1 348 with no outcome recorded.

We can easily produce summary tables with a range of different descirptive statistics. The summarise() after group_by allows you to more carefully specify the summary statistic operation to be performed. Below we will find the average age across each outcome group.

Remember to use na.rm = TRUE to exclude the NA values from the calculation of mean age.

avg_age_by_outcome <- linelist %>%
  group_by(outcome) %>%
  summarise(avg_age = mean(age, na.rm=TRUE ))

We see that the average age is roughly stable across outcomes, with those recovering being slightly lower at 14.7 years.

We can also group by more than 1 variable. You can either specify these variables, or use the group_by_at or group_by_if to use specified criteria in which to choose the grouping parameteres.

For instance, we can find the number of cases, by gender and month of onset…

count_gender_by_month_of_onset <- linelist %>%
  mutate(month_of_onset = format(date_onset,"%B")) %>%
  group_by(month_of_onset, gender) %>%
  tally()

We could also take initial records from each group, for instance, which can be handy if used in conjunction with sorting. Below we can sort by date_of_onset and then find the first case for each hospital

first_record_per_hosp <- linelist %>%
  arrange(date_onset) %>%
  group_by(hospital) %>%
  slice(1)

You can perform any summary function on grouped data; see the Cheat Sheet here for more info: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf

mutate with grouped data

To retain all of the other columns and just add a new variable for average age, we can use mutate instead of summarize. This could be perhaps be useful for some additional descriptive statistics where you wanted other variables still intact.

avg_age_by_outcome_2 <- linelist %>%
  group_by(outcome) %>%
  mutate(avg_age = mean(age, na.rm=TRUE ))

aggregate()

Joining & matching datasets

Overview

This page describes common “joins” and also probabilistic matching between dataframes.

Preparation

Because traditional joins (non-probabilistic) can be very specific, requiring exact string matches, you may need to do cleaning on the datasets prior to the join (e.g. change spellings, change case to all lower or upper).

Datasets

In the joining examples, we’ll use the following datasets:

  1. A “miniature” version of the linelist, containing only the columns case_id, date_onset, and hospital, and only the first 10 rows
  2. A separate dataframe named hosp_info, which contains more details about each hospital

“miniature” linelist

Below is the miniature linelist used for demonstration purposes:

linelist_mini <- linelist %>%                 # start with original linelist
  select(case_id, date_onset, hospital) %>%   # select columns
  head(10)                                    # keep only the first 10 rows

Hospital Information dataframe

Below is the separate dataframe with additional information about each hospital.

Pre-cleaning

Because traditional (non-probabilistic) joins are case-sensitive and require exact string matches, we will clean-up the hosp_info dataset prior to the joins.

Identify differences

We need the values of hosp_name column in hosp_info dataframe to match the values of hospital column in the linelist dataframe.

Here are the values in linelist_mini:

unique(linelist_mini$hospital)
## [1] "Central Hospital"                     "Port Hospital"                        "Other"                               
## [4] "Missing"                              "St. Mark's Maternity Hospital (SMMH)" "Military Hospital"

and here are the values in hosp_info:

unique(hosp_info$hosp_name)
## [1] "central hospital" "military"         "port"             "St. Mark's"       "ignace"           "sisters"

Align matching values

We begin by cleaning the values in hosp_name. We use logic to code the values in the new column using case_when() (LINK). We correct the hospital names that exist in both dataframes, and leave the others as they are (see TRUE ~ hosp_name).

CAUTION: Typically, one should create a new column (e.g. hosp_name_clean), but for ease of demonstration we show modification of the old column

hosp_info <- hosp_info %>% 
  mutate(
    hosp_name = case_when(
      hosp_name == "military"          ~ "Military Hospital",
      hosp_name == "port"              ~ "Port Hospital",
      hosp_name == "St. Mark's"        ~ "St. Mark's Maternity Hospital (SMMH)",
      hosp_name == "central hospital"  ~ "Central Hospital",
      TRUE                             ~ hosp_name
      )
    )

We now see that the hospital names that appear in both dataframe are aligned. There are some hospitals in hosp_info that are not present in linelist - we will deal with these later, in the join.

unique(hosp_info$hosp_name)
## [1] "Central Hospital"                     "Military Hospital"                    "Port Hospital"                       
## [4] "St. Mark's Maternity Hospital (SMMH)" "ignace"                               "sisters"

If you need to convert to all values UPPER or lower case, use these functions from stringr, as shown in the page on characters/strings (LINK).

str_to_upper()
str_to_upper()
str_to_title()

Joins

dplyr offers several different joins. Below they are described, with some simple use cases. Many thanks to https://github.com/gadenbuie for the moving images!

General syntax

General function structure

Any of these join commands can be run independently, like below.

An object is being created, or re-defined: dataframe 2 is being joined to dataframe 1, on the basis of matches between the “ID” column in df1 and “identifier” column in df2. Because this example uses left_join(), any rows in df2 that do not match to df1 will be dropped.

object <- left_join(df1, df2, by = c("ID" = "identifier"))

The join commands can also be run within a pipe chain. The first dataframe df1 is known to be the dataframe that is being passed through the pipes. An example is shown below, in context with some additional non-important mutate() and filter() commands before the join.

object <- df1 %>%
  filter(var1 == 2) %>%        # for demonstration only
  mutate(lag = day + 7) %>%    # for demonstration only
  left_join(df1, by = c("ID" = "identifier"))  # join df2 to df1

Join columns (by =)

You must specify the columns in each dataset in which the values must match, using the arguemnt by =. You have a few options:

  • Specify only one column name (by = "ID") - this only works if this exact column name is present in both dataframes!
  • Specify the different names (by = c("ID" = "Identifier") - use this if the column names are different in the 2 dataframes
  • Specify multiple columns to match on (by = c("ID" = "Identifier", "date_onset" = "Date_of_Onset")) - this will require exact matches on multiple columns for rows to join.

CAUTION: Joins are case-specific! Therefore it is useful to convert all values to lowercase or uppercase prior to joining. See the page on characters/strings.

Left & right joins

A left or right join is commonly used to add information to a dataframe - new information is added only to rows that already exist in the baseline dataframe.

These are common joins in epidemiological work - they are used to add information from one dataset into another.

The order of the dataframes is important.

  • In a left join, the first (left) dataframe listed is the baseline
  • In a right join, the second (right) dataframe listed is the baseline

All rows of the baseline dataframe are kept. Information in the secondary dataframe is joined to the baseline dataframe only if there is a match via the identifier column(s). In addition:
* Rows in the secondary dataframe that do not match are dropped.
* If there are many baseline rows that match to one row in the secondary dataframe (many-to-one), the baseline information is added to each matching baseline row.
* If a baseline row matches to multiple rows in the secondary dataframe (one-to-many), all combinations are given, meaning new rows may be added to your returned dataframe!

Example

Below is the output of a left_join() of hosp_info (secondary dataframe) into linelist_mini (baseline dataframe). Note the following:

  • All original rows of the baseline dataframe linelist_mini are kept
  • One original row of linelist_mini is duplicated (“Military Hospital”) because it matched to two rows in the secondary dataframe, so both combinations are returned
  • The join identifier column of the secondary dataset (hosp_name) has disappeared because it is redundant with the identifier column in the primary dataset (hospital)
  • When a baseline row did not match to any secondary row (e.g. when hospital is “Other” or “Missing”), NA fills in the columns from the secondary dataframe
  • Rows in the secondary dataframe with no match to the baseline dataframe (“sisters” and “ignace”) were dropped
linelist_mini %>% 
  left_join(hosp_info, by = c("hospital" = "hosp_name"))

“Should I use a right join, or a left join?”
Most important is to ask “which dataframe should retain all of its rows?” - use this one as the baseline.

The two commands below achieve the same output - 10 rows of hosp_info joined into a linelist_mini baseline. However, the column order will differ based on whether hosp_info arrives from the right (in the left join) or arrives from the left (in the right join). The order of the rows may also shift consequently.

Also consider whether your use-case is within a pipe chain (%>%). If the dataset in the pipes is the baseline, you will likely use a left join to add data to it.

# The two commands below achieve the same data, but with differently ordered rows and columns
left_join(linelist_mini, hosp_info, by = c("hospital" = "hosp_name"))
right_join(hosp_info, linelist_mini, by = c("hosp_name" = "hospital"))

Full join

A full join is the most inclusive of the joins - it returns all rows from both dataframes.

If there are any rows present in one and not the other (where no match was found), the dataframe will become wider as NA values are added to fill-in. Watch the number of columns and rows carefully and troubleshoot case-sensitivity and exact string matches.

Adjustment of the “baseline” (first) dataframe will not impact which records are returned, but it will impact the column order, row order, and which identifier column is retained.

Example

Below is the output of a full_join() of hosp_info into linelist_mini. Note the following:

  • All baseline rows (linelist_mini) are kept
  • One baseline row is duplicated (“Military Hospital”) because it matched to two secondary rows and both combinations are returned
  • Only the identifier column from the baseline is kept (hospital)
  • NA fills in where baseline rows did not match to secondary rows (hospital was “Other” or “Missing”), or the opposite (where hosp_name was “ignace” or “sisters”)
linelist_mini %>% 
  full_join(hosp_info, by = c("hospital" = "hosp_name"))

Inner join

An inner join is the most restrictive of the joins - it returns only rows with matches across both dataframes.
This means that your original dataset may reduce in number of rows. Adjustment of the “baseline” (first) dataframe will not impact which records are returned, but it will impact the column order, row order, and which identifier column is retained.

Example

Below is the output of an inner_join() of linelist_mini (baseline) with hosp_info (secondary). Note the following:

  • Not all baseline rows are kept (rows where hospital is “Missing” or “Other” are removed because had no match in the secondary dataframe
  • Likewise, secondary rows where hosp_name is “sisters” or “ignace” are removed as they have no match in the baseline dataframe
  • Only the identifier column from the baseline is kept (hospital)
linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))
hosp_info %>% 
  inner_join(linelist_mini, by = c("hosp_name" = "hospital"))

Anti join

The anti join returns rows in dataframe 1 that do not have a match in dataframe 2.

Common scenarios for an anti-join include identifying records not present in another dataframe, troubleshooting spelling in a join (catching records that should have matched), and examining records that were excluded after another join.

As with right_join() and left_join(), the baseline dataframe (listed first) is important. The returned rows are from it only. Notice in the gif below that row in the non-baseline dataframe (purple 4) is not returned even though it does not match.

Simple example

For an example, let’s find the hosp_info hospitals that do not have any cases present in linelist_mini. We list hosp_info first, as the baseline dataframe. The two hospitals which are not present in linelist_mini are returned.

hosp_info %>% 
  anti_join(linelist_mini, by = c("hosp_name" = "hospital"))

Example 2

For another example, let us say we ran an inner_join() between linelist_mini and hosp_info. This returns only 8 of the original 11 linelist_mini records.

linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))

To review the 3 linelist_mini records that were excluded in the inner join, we can run an anti-join with linelist_mini as the baseline dataframe.

linelist_mini %>% 
  anti_join(hosp_info, by = c("hospital" = "hosp_name"))

To see the hosp_info records that were excluded in the inner join, we could also run an anti-join with hosp_info as the baseline dataframe.

Probabalistic matching

rowmatcher other options (finlay?)

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

The dplyr page on joins

Characters/strings

Overview

This tab demonstrates use of the stringr package to evaluate and manage character (strings).

  1. Evaluate and subset/extract - str_length(), str_sub(), word()
  2. Combine, order, arrange - str_c(), str_glue(), str_order()
  3. Modify and replace - str_sub(), str_replace_all()
  4. Adjust length - str_pad(), str_trunc(), str_wrap()
  5. Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()
  6. Search for patterns - str_detect(), str_subset(), str_match()

For ease of display most examples are shown acting on a short defined character vector, however they can easily be applied/adapted to a column within a dataset.

Much of this page is adapted from this online vignette

Preparation

Install or load the stringr package.

# install or load the stringr package
pacman::p_load(stringr,   # many functions for handling strings
               tidyverse,  # for optional data manipulation
               tools      # alternative for converting to title case
               )

A reference sheet for stringr functions can be found here

Evaluate and subset

Evaluate the length of a string

str_length("abc")
## [1] 3

Alternatively, use nchar() from base R

Subset/extract string by position

Use str_sub() to return only a part of a string. The function takes three main arguments:

  1. the character vector(s)
  2. start position
  3. end position

A few notes on position numbers:

  • If a position number is positive, the position is counted starting from the left end of the string.
  • If a position number is negative, it is counted starting from the right end of the string.
  • Position numbers are inclusive.
  • Positions extending beyond the string will be truncated (removed).

Below are some examples applied to the string “pneumonia”:

# third from left
str_sub("pneumonia", 3, 3)
## [1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)
## [1] ""
# 6th from right, to the first from right
str_sub("pneumonia", 6, -1)
## [1] "onia"
# fifth from right, to the first from right
str_sub("pneumonia", -5, -1)
## [1] "monia"
# positions outside the string
str_sub("pneumonia", 4, 15)
## [1] "umonia"

Subset string by word position

To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.

By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.

# strings to evaluate
chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
                      "My stomach hurts",
                      "Severe ear pain")

# extract 1st-3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")
## [1] "I just got"       "My stomach hurts" "Severe ear pain"

Combine, order, and arrange

This section covers using str_c(), str_glue(), str_order(), to combine, arrange, and paste together strings.

Combine strings

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. The act similarly to str_c() but the syntax differs - the parts (either text or code/pre-defined objects) are separated by commas, for example: paste("Regional hospital needs", n_beds, "beds and", n_masks, "masks."). The sep and collapse arguments can be adjusted. By default sep is a space, unless using paste0() where there is no space between parts.

To combine multiple strings into one string, you can use str_c(), which is the stringr version of c() (concatenate).

str_c("String1", "String2", "String3")
## [1] "String1String2String3"

The argument sep = inserts characters between each input vectors (e.g. a comma or newline "\n")

str_c("String1", "String2", "String3", sep = ", ")
## [1] "String1, String2, String3"

The argument collapse = is relevant if producing multiple elements. The example below shows the combination of first and last names. The sep value goes between each first and last name, while the collapse value goes between the people.

first_names <- c("abdul", "fahruk", "janice") 
last_names  <- c("hussein", "akinleye", "musa")

# sep is between the respective strings, while collapse is between the elements produced
str_c(first_names, last_names, sep = " ", collapse = ";  ")
## [1] "abdul hussein;  fahruk akinleye;  janice musa"
# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
## abdul hussein;
## fahruk akinleye;
## janice musa

Glueing strings and other values

str_glue()

You can also combine strings and other pre-defined values and characters with str_glue(). This is a very useful function for creating dynamic plot captions, as demonstrated below.

  • All content goes between quotation marks ("").
  • Any dynamic code or calls to pre-defined objects must be within curly brackets {}. There can be many curly brackets.
  • Within the outer quotation marks, use single quotes if necessary (e.g. when providing date format)
  • You can provide newlines (\n), use format() to display dates, use Sys.Date() to display the current date.
  • If using the %>% pipe operator, ensure the tidyverse package is loaded.

A simple example:

str_glue("The linelist is current to {format(Sys.Date(), '%d %b %Y')} and includes {nrow(linelist)} cases.")
## The linelist is current to 31 Jan 2021 and includes 5889 cases.

An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the codes are long.

str_glue("Data source is the confirmed case linelist as of {current_date}.\nThe last case was reported hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
         current_date = format(Sys.Date(), '%d %b %Y'),
         last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
         n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
         )
## Data source is the confirmed case linelist as of 31 Jan 2021.
## The last case was reported hospitalized on 30 Apr 2015.
## 248 cases are missing date of onset and not shown

Sometimes, it is useful to pull data from dataframe and have it pasted together in sequence. Below is an example using this dataset to make a summary output of jurisdictions and the new and total cases:

DT::datatable(case_table, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Use str_c() with the dataframe and column names (as in the example above with first & last names). Provide sep and collapse arguments.

str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  ")
## [1] "Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

We add the text “New Cases:” to the beginning of the summary by using wrapping with a separate str_c(). If “New Cases” was added within the original str_c(), it would appear multiple times.

str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  "))
## [1] "New Cases: Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

You can achieve a similar result with str_glue(), with newlines added automatically:

str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)")
## Zone 1: 3 new cases (40 total cases)
## Zone 2: 0 new cases (4 total cases)
## Zone 3: 7 new cases (25 total cases)
## Zone 4: 0 new cases (10 total cases)
## Zone 5: 15 new cases (103 total cases)

To use str_glue() but have more control (e.g. to use double newlines), wrap it within str_c() and adjust the collapse value. You may need to print using cat() to correctly print the newlines.

case_summary <- str_c(str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)"), collapse = "\n\n")

cat(case_summary) # print
## Zone 1: 3 new cases (40 total cases)
## 
## Zone 2: 0 new cases (4 total cases)
## 
## Zone 3: 7 new cases (25 total cases)
## 
## Zone 4: 0 new cases (10 total cases)
## 
## Zone 5: 15 new cases (103 total cases)

Sorting

Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.

# strings
health_zones <- c("Alba", "Takota", "Delta")

# return the alphabetical order
str_order(health_zones)
## [1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)
## [1] "Alba"   "Delta"  "Takota"

To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.

base R functions

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. The act similarly to str_c() but the syntax differs - the parts (either text or code/pre-defined objects) are separated by commas, for example: paste("Regional hospital needs", n_beds, "beds and", n_masks, "masks."). The sep and collapse arguments can be adjusted. By default sep is a space, unless using paste0() where there is no space between parts.

Modify and replace

Replace specific character positions

str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:

word <- "pneumonia"

# convert the third and fourth characters to X 
str_sub(word, 3, 4) <- "XX"

word
## [1] "pnXXmonia"

An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.

words <- c("pneumonia", "tubercolosis", "HIV")

# convert the third and fourth characters to X 
str_sub(words, 3, 4) <- "XX"

words
## [1] "pnXXmonia"    "tuXXrcolosis" "HIXX"

Replace patterns

Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated, then the pattern to be replaced, and then the replacement value. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.

outcome <- c("Karl: dead",
            "Samantha: dead",
            "Marco: not dead")

str_replace_all(outcome, "dead", "deceased")
## [1] "Karl: deceased"      "Samantha: deceased"  "Marco: not deceased"

To replace a pattern with NA, use str_replace_na(). The function str_replace() replaces only the first instance of the pattern within each evaluated string.

Adjust length

Increase minimum length (pad)

Use str_pad() to add characters to a string, to a minimum length.

By default spaces are added, but you can also pad with other characters using the pad = argument.

# ICD codes of differing length
ICD_codes <- c("R10.13",
               "R10.819",
               "R17")

# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")
## [1] "R10.13 " "R10.819" "R17    "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")
## [1] "R10.13." "R10.819" "R17...."

For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".

# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0") 
## [1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")

Truncate/shorten

str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).

original <- "Symptom onset on 4/3/2020 with vomiting"
str_trunc(original, 10, "center")
## [1] "Symp...ing"

To ensure each value is the same length

Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then a very short value is padded to achieve length of 6.

# ICD codes of differing length
ICD_codes   <- c("R10.13",
                 "R10.819",
                 "R17")

# truncate to maximum length of 6
ICD_codes_2 <- str_trunc(ICD_codes, 6)
ICD_codes_2
## [1] "R10.13" "R10..." "R17"
# expand to minimum length of 6
ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3
## [1] "R10.13" "R10..." "R17   "

Remove leading/trailing whitespace

Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input.
Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").

# ID numbers with excess spaces on right
IDs <- c("provA_1852  ", # two excess spaces
         "provA_2345",   # zero excess spaces
         "provA_9460 ")  # one excess space

# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)
## [1] "provA_1852" "provA_2345" "provA_9460"

Remove repeated whitespace within strings

Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().

# original contains excess spaces within string
str_squish("  Pt requires   IV saline\n") 
## [1] "Pt requires IV saline"

Enter ?str_trim, ?str_pad in your R console to see further details.

Wrap lines into paragraphs

Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.

pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."

str_wrap(pt_course, 40)
## [1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."

The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.

cat(str_wrap(pt_course, 40))
## Symptom onset 1/4/2020 vomiting chills
## fever. Pt saw traditional healer in
## home village on 2/4/2020. On 5/4/2020
## pt symptoms worsened and was admitted
## to Lumta clinic. Sample was taken and pt
## was transported to regional hospital on
## 6/4/2020. Pt died at regional hospital
## on 7/4/2020.

Change case

Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_upper(), and str_to_title(), as shown below:

str_to_upper("California")
## [1] "CALIFORNIA"
str_to_lower("California")
## [1] "california"

Using *base** R, the above can also be achieved with toupper(), tolower().

Title case Transforming the string so each word is capitalized can be achieved with str_to_title():

str_to_title("go to the US state of california ")
## [1] "Go To The Us State Of California "

Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).

tools::toTitleCase("This is the US state of california")
## [1] "This is the US State of California"

You can also use str_to_sentence(), which capitalizes only the first letter of the string.

str_to_sentence("the patient must be transported")
## [1] "The patient must be transported"

Patterns

Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.

Detect presence/absence of a pattern

Use str_detect() as below. Note that by default the search is case sensitive!

str_detect("primary school teacher", "teach")
## [1] TRUE

The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.

str_detect("primary school teacher", "teach", negate = TRUE)
## [1] FALSE

To ignore case/capitalization, wrap the pattern within regex() and within regex() add the argument ignore_case = T.

str_detect("Teacher", regex("teach", ignore_case = T))
## [1] TRUE

When str_detect() is applied to a character vector/column, it will return a TRUE/FALSE for each of the values in the vector.

# a vector/column of occupations 
occupations <- c("field laborer",
                 "university professor",
                 "primary school teacher & tutor",
                 "tutor",
                 "nurse at regional hospital",
                 "lineworker at Amberdeen Fish Factory",
                 "physican",
                 "cardiologist",
                 "office worker",
                 "food service")

# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

If you need to count these, apply sum() to the output. This counts the number TRUE.

sum(str_detect(occupations, "teach"))
## [1] 1

To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern, as shown below:

sum(str_detect(occupations, "teach|professor|tutor"))
## [1] 3

If you need to make a long list of search terms, you can combine them using str_c() and sep = |, define this is a character object, and reference it later more succinctly. The example below includes possible occupation search terms for frontline medical providers.

# search terms
occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
                                "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
                               "intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
                               "cna", "pa", "physician assistant", "mental health",
                               "emergency department technician", "resp therapist", "respiratory",
                                "phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
                               "rehab", "activity", "elderly", "subacute", "sub acute",
                                "clinic", "post acute", "therapist", "extended care",
                                "dental", "dential", "dentist", sep = "|")

occupation_med_frontline
## [1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"

This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):

sum(str_detect(occupations, occupation_med_frontline))
## [1] 2

Base R string search functions

The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve regex() function).

Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.

Detects patterns in conditional logic

Within case_when()

str_detect() is often used within case_when() (from dplyr). Let’s say the occupations are a column in the linelist called occupations. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().

df <- df %>% 
  mutate(is_educator = case_when(
    # term search within occupation, not case sensitive
    str_detect(occupations,
               regex("teach|prof|tutor|university",
                     ignore_case = TRUE))              ~ "Educator",
    # all others
    TRUE                                               ~ "Not an educator"))

As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):

df <- df %>% 
  # value in new column is_educator is based on conditional logic
  mutate(is_educator = case_when(
    
    # occupation column must meet 2 criteria to be assigned "Educator":
    # it must have a search term AND NOT any exclusion term
    
    # Must have a search term AND
    str_detect(occupations,
               regex("teach|prof|tutor|university", ignore_case = T)) &              
    # Must NOT have an exclusion term
    str_detect(occupations,
               regex("admin", ignore_case = T),
               negate = T)                          ~ "Educator"
    
    # All rows not meeting above criteria
    TRUE                                            ~ "Not an educator"))

Locate pattern position

To locate the first position of a pattern, use str_locate(). It outputs a start and end position.

str_locate("I wish", "sh")
##      start end
## [1,]     5   6

Like other str functions, there is an "_all" version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.

phrases <- c("I wish", "I hope", "he hopes", "He hopes")

str_locate(phrases, "h" )     # position of *first* instance of the pattern
##      start end
## [1,]     6   6
## [2,]     3   3
## [3,]     1   1
## [4,]     4   4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
## [[1]]
##      start end
## [1,]     6   6
## 
## [[2]]
##      start end
## [1,]     3   3
## 
## [[3]]
##      start end
## [1,]     1   1
## [2,]     4   4
## 
## [[4]]
##      start end
## [1,]     4   4

Extract a match

str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.

str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.

str_extract_all(occupations, "teach|prof|tutor")
## [[1]]
## character(0)
## 
## [[2]]
## [1] "prof"
## 
## [[3]]
## [1] "teach" "tutor"
## 
## [[4]]
## [1] "tutor"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## character(0)
## 
## [[9]]
## character(0)
## 
## [[10]]
## character(0)

str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.

str_extract(occupations, "teach|prof|tutor")
##  [1] NA      "prof"  "teach" "tutor" NA      NA      NA      NA      NA      NA

Subset and Count

Subset, Count

Aligned functions include str_subset() and str_count().

str_subset() returns the actual values which contained the pattern:

str_subset(occupations, "teach|prof|tutor")
## [1] "university professor"           "primary school teacher & tutor" "tutor"

`str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.

str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
##  [1] 0 1 2 1 0 0 0 0 0 0

Splitting

To split a string based on a pattern, use str_split(). It evaluates the strings and returns a list of character vectors consisting of the newly-split values.

The simple example below evaluates one string, and produces a list with one element - a character vector with three values:

str_split("jaundice, fever, chills", ",", simplify = T)
##      [,1]       [,2]     [,3]     
## [1,] "jaundice" " fever" " chills"

You can assign this as a named object, and access the nth symptom:

pt1_symptoms <- str_split("jaundice, fever, chills", ",", simplify = T)

pt1_symptoms[2]
## [1] " fever"

If multiple strings are evaluated, there will be more than one element in the returned list.

symptoms <- c("jaundice, fever, chills",     # patient 1
              "chills, aches, pains",        # patient 2 
              "fever",                       # patient 3
              "vomiting, diarrhoea",         # patient 4
              "bleeding from gums, fever",   # patient 5
              "rapid pulse, headache")       # patient 6

str_split(symptoms, ",")                     # split each patient's symptoms
## [[1]]
## [1] "jaundice" " fever"   " chills" 
## 
## [[2]]
## [1] "chills" " aches" " pains"
## 
## [[3]]
## [1] "fever"
## 
## [[4]]
## [1] "vomiting"   " diarrhoea"
## 
## [[5]]
## [1] "bleeding from gums" " fever"            
## 
## [[6]]
## [1] "rapid pulse" " headache"

To access a specific symptom you can use syntax like this: the_split_return_object[[2]][1], which would access the first symptom from the second evaluated string (“chills”). See the R basics page for more detail on accessing elements.

To return a “character matrix” instead, which may be useful if creating dataframe columns, set the argument simplify = TRUE as shown below:

str_split(symptoms, ",", simplify = T)
##      [,1]                 [,2]         [,3]     
## [1,] "jaundice"           " fever"     " chills"
## [2,] "chills"             " aches"     " pains" 
## [3,] "fever"              ""           ""       
## [4,] "vomiting"           " diarrhoea" ""       
## [5,] "bleeding from gums" " fever"     ""       
## [6,] "rapid pulse"        " headache"  ""

You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits (from the left side) to 2 splits. The further commas remain within the second split.

str_split(symptoms, ",", simplify = T, n = 2)
##      [,1]                 [,2]            
## [1,] "jaundice"           " fever, chills"
## [2,] "chills"             " aches, pains" 
## [3,] "fever"              ""              
## [4,] "vomiting"           " diarrhoea"    
## [5,] "bleeding from gums" " fever"        
## [6,] "rapid pulse"        " headache"

Note - the same outputs can be achieved with str_split_fixed(), in which you do not* give the simplify argument, but must instead designate the number of columns (n).*

str_split_fixed(symptoms, ",", n = 2)

Splitting a column within a dataframe

Within a dataframe, to split one character column into other columns use use separate() from dplyr.

If we have a simple dataframe df consisting of a case ID column, one character column with symptoms, and one outcome column:

First provide the column to be separated, then provide a vector (c()) of new columns names to the argument into =, as shown below. The argument sep = can be a character, or a number (interpreted as the character position to split at).

Optional arguments include remove = (FALSE by default, removes the input column) and convert = (FALSE by default, will cause string “NA”s to become NA).

extra = will control what happens if there are more many values created by the separation than new columns named. Setting extra equal to "warn" means R will return a warning but proceed and drop the values (the default). "drop" means the values will be dropped with no warning.

Setting extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data.

# third symptoms combined into second new column
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
##   case_ID              sym_1          sym_2 outcome
## 1       1           jaundice  fever, chills Success
## 2       2             chills   aches, pains Failure
## 3       3              fever           <NA> Failure
## 4       4           vomiting      diarrhoea Success
## 5       5 bleeding from gums          fever Success
## 6       6        rapid pulse       headache Success
# third symptoms are lost
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2", "sym_3"), sep=",")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 4 rows [3, 4, 5, 6].
##   case_ID              sym_1      sym_2   sym_3 outcome
## 1       1           jaundice      fever  chills Success
## 2       2             chills      aches   pains Failure
## 3       3              fever       <NA>    <NA> Failure
## 4       4           vomiting  diarrhoea    <NA> Success
## 5       5 bleeding from gums      fever    <NA> Success
## 6       6        rapid pulse   headache    <NA> Success
# third symptoms given their own column
separated <- df %>% 
  separate(symptoms, into = c("sym_1", "sym_2", "sym_3"), sep=",")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 4 rows [3, 4, 5, 6].
separated
##   case_ID              sym_1      sym_2   sym_3 outcome
## 1       1           jaundice      fever  chills Success
## 2       2             chills      aches   pains Failure
## 3       3              fever       <NA>    <NA> Failure
## 4       4           vomiting  diarrhoea    <NA> Success
## 5       5 bleeding from gums      fever    <NA> Success
## 6       6        rapid pulse   headache    <NA> Success

CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.

One solution to automatically make as many columns as needed could be:

unite()

Within a dataframe, bringing together multiple columns (the opposite of separate()) can be achieved with unite() from tidyr.

Provide the name of the new united column. Then provide the names of the columns you wish to unite. By default the separator used in the united column is "_", but this can be changed with the sep argument. Other optional arguments include remove = (TRUE by default, removes the input columns from the data frame), and na.rm = (FALSE by default, it removes missing values while uniting).

Below, we re-unite the dataframe that was separated above.

separated %>% 
  unite(
    col = "all_symptoms",         # name of the new united column
    c("sym_1", "sym_2", "sym_3"), # columns to unite
    sep = ", ",                   # separator to use in united column
    remove = TRUE,                # if TRUE, removes input cols from the data frame
    na.rm = TRUE                  # if TRUE, missing values are removed before uniting
  )
##   case_ID               all_symptoms outcome
## 1       1  jaundice,  fever,  chills Success
## 2       2     chills,  aches,  pains Failure
## 3       3                      fever Failure
## 4       4       vomiting,  diarrhoea Success
## 5       5 bleeding from gums,  fever Success
## 6       6     rapid pulse,  headache Success

Regex groups

Groups within strings

str_match() TBD

Regex and special characters

Regular expressions, or “regex”, is a concise language for describing patterns in strings.

Much of this tab is adapted from this tutorial and this cheatsheet

Special characters

Backslash \ as escape

The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.

Note - thus, if you want to display a backslash, you must escape it’s meaning with *another backslash. So you must write two backslashes \\ to display one.

Special characters

Special character Represents
"\\" backslash
"\n" a new line (newline)
"\"" double-quote within double quotes
'\'' single-quote within single quotes
"\| grave accent| carriage return| tab| vertical tab"` backspace

Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).

Regular expressions (regex)

If you are not familiar with it, a regular expression can look like an alien language:

A regular expression is applied to extract specific patterns from unstructured text - for example medical notes, chief complaint, matient history, or other free text columns in a dataset.

There are four basic tools one can use to create a basic regular expression:

  1. Character sets
  2. Meta characters
  3. Quantifiers
  4. Groups

Character sets

Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

Character set Matches for
"[A-Z]" any single capital letter
"[a-z]" any single lowercase letter
"[0-9]" any digit
[:alnum:] any alphanumeric character
[:digit:] any numeric digit
[:alpha:] any letter (upper or lowercase)
[:upper:] any uppercase letter
[:lower:] any lowercase letter

Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).

Meta characters

Meta characters are shorthand for character sets. Some of the important ones are listed below:

Meta character Represents
"\\s" a single space
"\\w" any single alphanumeric character (A-Z, a-z, or 0-9)
"\\d" any single numeric digit (0-9)

Quantifiers

Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.

Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example,

  • "A{2}" will return instances of two capital A letters.
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
  • "A{2,}" will return instances of two or more capital A letters.
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)

Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

# test string for quantifiers
test <- "A-AA-AAA-AAAA"

When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.

str_extract_all(test, "A{2}")
## [[1]]
## [1] "AA" "AA" "AA" "AA"

When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.

str_extract_all(test, "A{2,4}")
## [[1]]
## [1] "AA"   "AAA"  "AAAA"

With the quantifier +, groups of one or more are returned:

str_extract_all(test, "A+")
## [[1]]
## [1] "A"    "AA"   "AAA"  "AAAA"

Relative position

These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])

str_extract_all(test, "")
## [[1]]
##  [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
Position statement Matches to
"(?<=b)a" “a” that is preceded by a “b”
"(?<!b)a" “a” that is NOT preceded by a “b”
"a(?=b)" “a” that is followed by a “b”
"a(?!b)" “a” that is NOT followed by a “b”

Groups

Capturing groups in your regular expression is a way to have a more organized output upon extraction.

Regex examples

Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.

pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."

This expression matches to all words (any character until hitting non-character such as a space):

str_extract_all(pt_note, "[A-Za-z]+")
## [[1]]
##  [1] "Patient"     "arrived"     "at"          "Broward"     "Hospital"    "emergency"   "ward"        "at"          "on"         
## [10] "Patient"     "presented"   "with"        "radiating"   "abdominal"   "pain"        "from"        "LR"          "quadrant"   
## [19] "Patient"     "skin"        "was"         "pale"        "cool"        "and"         "clammy"      "Patient"     "temperature"
## [28] "was"         "degrees"     "farinheit"   "Patient"     "pulse"       "rate"        "was"         "bpm"         "and"        
## [37] "thready"     "Respiratory" "rate"        "was"         "per"         "minute"

The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".

str_extract_all(pt_note, "[0-9]{1,2}")
## [[1]]
##  [1] "18" "00" "6"  "12" "20" "05" "99" "8"  "10" "0"  "29"
str_split(pt_note, ".")
## [[1]]
##   [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [44] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [87] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [130] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [173] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [216] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [259] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [302] "" "" "" "" "" "" "" ""

This expression will extract all sentences (assuming first letter is capitalized, and the sentence ends with a period). The pattern reads in English as: "A capital letter followed by some lowercase letters, a space, some letters, a space,

str_extract_all(pt_note, "[A-Z][a-z]+\\s\\w+\\s\\d{1,2}\\s\\w+\\s*\\w*")
## [[1]]
## character(0)

You can view a useful list of regex expressions and tips on page 2 of this cheatsheet

Also see this tutorial.

Resources

A reference sheet for stringr functions can be found here

A vignette on stringr can be found here

De-duplication

Overview

This page covers the following subjects:

  1. Identifying and removing duplicate rows
  2. “Slicing” and keeping only certain rows (min, max, random…), also from each group
  3. “Rolling-up”, or combining values from multiple rows into one

Preparation

Load packages

pacman::p_load(tidyverse,   # deduplication, grouping, and slicing functions
               janitor,     # function for reviewing duplicates
               stringr      # for string searches, can be used in "rolling-up" values
               )     

Example dataset

For demonstration, we will use the fake dataset below. It is a record of COVID-19 phone encounters, including with contacts and with cases.

  • The first two records are 100% complete duplicates including duplicate recordID (computer glitch)
  • The second two rows are duplicates, in all columns except for recordID
  • Several people had multiple phone encounters, at various dates/times and as contacts or cases
  • At each encounter, the person was asked if they had ever had symptoms, and some of this information is missing.

Deduplication

This tab uses the dataset from the Preparation tab to describe how to review and remove duplicate rows in a dataframe. It also show how to handle duplicate elements in a vector.

Examine duplicate rows

To quickly review rows that have duplicates, you can use get_dupes() from the janitor package. By default, all columns are considered when duplicates are evaluated - rows returned are 100% duplicates considering the values in all columns.

In the obs dataframe, the first two rows that are 100% duplicates - they have the same value in every column (including the recordID column, which is supposed to be unique - it must be some computer glitch). The returned dataframe automatically includes a new column dupe_count, showing the number of rows with that combination of duplicate values.

# 100% duplicates across all columns
obs %>% 
  janitor::get_dupes()

However, if we choose to ignore recordID, the 3rd and 4th rows rows are also duplicates. That is, they have the same values in all columns except for recordID. You can specify specific columns to be ignored in the function using a - minus symbol.

# Duplicates when column recordID is not considered
obs %>% 
  janitor::get_dupes(-recordID)         # if multiple columns, wrap them in c()

You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.

*Scroll left for more rows**

# duplicates based on name and purpose columns ONLY
obs %>% 
  janitor::get_dupes(name, purpose)

See ?get_dupes for more details, or see this online reference

Keep only unique rows

To keep only unique rows of a dataframe, use distinct() from dplyr. Rows that are duplicates are removed such that only the first of such rows is kept. By default, “first” means the highest rownumber (order of rows top-to-bottom). Only unique rows are kept. In the example below, one duplicate row (the first row, for “adam”) has been removed (n is now 18, not 19 rows).

Scroll to the left to see the entire dataframe

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(across(-recordID), # reduces dataframe to only unique rows (keeps first one of any duplicates)
           .keep_all = TRUE) 

# if outside pipes, include the data as first argument 
# distinct(obs)

CAUTION: If using distinct() on grouped data, the function will apply to each group.

Deduplicate based on specific columns

You can also specify columns to be the basis for de-duplication. In this way, the de-duplication only applies to rows that are duplicates within the specified columns. Unless specified with .keep_all = TRUE, all columns not mentioned will be dropped.

In the example below, the de-duplication only applies to rows that have identical values for name and purpose columns. Thus, “brian” has only 2 rows instead of 3 - his first “contact” encounter and his only “case” encounter. To adjust so that brian’s latest encounter of each purpose is kept, see the tab on Slicing within groups.

Scroll to the left to see the entire dataframe

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(name, purpose, .keep_all = TRUE) %>%  # keep rows unique by name and purpose, retain all columns
  arrange(name)                                  # arrange for easier viewing

Duplicate elements in a vector

The function duplicated() from base R will evaluate a vector (column) and return a logical vector of the same length (TRUE/FALSE). The first time a value appears, it will return FALSE (not a duplicate), and subsequent times that value appears it will return TRUE. Note how NA is treated the same as any other value.

x <- c(1, 1, 2, NA, NA, 4, 5, 4, 4, 1, 2)
duplicated(x)
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

To return only the duplicated elements, you can use brackets to subset the original vector:

x[duplicated(x)]
## [1]  1 NA  4  4  1  2

To return only the unique elements, use unique() from base R. To remove NAs from the output, nest na.omit() within unique().

unique(x)           # alternatively, use x[!duplicated(x)]
## [1]  1  2 NA  4  5
unique(na.omit(x))  # remove NAs 
## [1] 1 2 4 5

with base R

To return duplicate rows

In base R, you can also see which rows are 100% duplicates in a dataframe df with the command duplicated(df) (returns a logical vector of the rows).

Thus, you can also use the base subset [ ] on the dataframe to see the duplicated rows with df[duplicated(df),] (don’t forget the comma, meaning that you want to see all columns!).

To return unique rows

See the notes above. To see the unique rows you add the logical negator ! in front of the duplicated() function:
df[!duplicated(df),]

To return rows that are duplicates of only certain columns

Subset the df that is within the duplicated() parentheses, so this function will operate on only certain columns of the df.

To specify the columns, provide column numbers or names after a comma (remember, all this is within the duplicated() function).

Be sure to keep the comma , outside after the duplicated() function as well!

For example, to evaluate only columns 2 through 5 for duplicates: df[!duplicated(df[, 2:5]),]
To evaluate only columns name and purpose for duplicates: df[!duplicated(df[, c("name", "purpose)]),]

Slicing

To “slice” a dataframe is useful in de-duplication if you have multiple rows per functional group (e.g. per “person”) and you only want to analyze one or some of them. Think of slicing a filter on the rows, by row number/position.

The basic slice() function accepts a number n. If positive, only the nth row is returned. If negative, all rows except the nth are returned.

Variations include:

  • slice_min() and slice_max() - to keep only the row with the minimium or maximum value of the specified column. Also worked with ordered factors.
  • slice_head() and slice_tail - to keep only the first or last row
  • slice_sample() - to keep only a random sample of the rows

Use arguments n = or prop = to specify the number or proportion of rows to keep. If not using the function in a pipe chain, provide the data argument first (e.g. slice(df, n = 2)). See ?slice for more information.

Other arguments:

.order_by = - used in slice_min() and slice_max() this is a column to order by before slicing.
with_ties = - TRUE by default, meaning ties are kept.
.preserve = - FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
weight_by = - Optional, numeric column to weight by (bigger number more likely to get sampled). Also replace = for whether sampling is done with/without replacement.

TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:is not empty.

NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.

Here, the basic slice() function is used to keep only the 4th row:

obs %>% 
  slice(4)  # keeps the 4th row only

Slice with groups

The slice_*() functions can be very useful if applied to a grouped dataframe, as the slice operation is performed on each group separately. Use the function group_by() in conjunction with slice() to group the data and then take a slice from each group.
This is helpful for de-duplication if you have multiple rows per person but only want to keep one of them. You first use group_by() with key columns that are the same, and then use a slice function on a column that will differ among the grouped rows.

In the example below, to keep only the latest encounter per person, we group the rows by name and then use slice_max() with n = 1 on the date column. Be aware! To apply a function like slice_max() on dates, the date column must be class Date.

By default, “ties” (e.g. same date in this scenario) are kept, and we would still get multiple rows for some people (e.g. adam). To avoid this we set with_ties = FALSE. We get back only one row per person.

CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.

DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.

obs %>% 
  group_by(name) %>%       # group the rows by 'name'
  slice_max(date,          # keep row per group with maximum date value 
            n = 1,         # keep only the single highest row 
            with_ties = F) # if there's a tie (of date), take the first row

Breaking “ties”

Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.

# Example of multiple slice statements to "break ties"
obs %>%
  group_by(name) %>%
  
  # FIRST - slice by latest date
  slice_max(date, n = 1, with_ties = TRUE) %>% 
  
  # SECOND - if there is a tie, select row with latest time; ties prohibited
  slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)

In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.

TIP: To use slice_max() or slice_min() on a “character” column, mutate it to an ordered factor class!

Keep all but mark them

If you want to keep all records but mark only some for analysis, consider a two-step approach utilizing a unique recordID/encounter number:

  1. Reduce/slice the orginal dataframe to only the rows for analysis. Save/retain this reduced dataframe.
  2. In the original dataframe, mark rows as appropriate with case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced dataframe.
# 1. Define dataframe of rows to keep for analysis
obs_keep <- obs %>%
  group_by(name) %>%
  slice_max(encounter, n = 1, with_ties = FALSE) # keep only latest encounter per person


# 2. Mark original dataframe
obs_marked <- obs %>%

  # make new dup_record column
  mutate(dup_record = case_when(
    
    # if record is in obs_keep dataframe
    recordID %in% obs_keep$recordID ~ "For analysis", 
    
    # all else marked as "Ignore" for analysis purposes
    TRUE                            ~ "Ignore"))

# print
obs_marked

Calculate row completeness

Create a column that contains a metric for the row’s completeness (non-missingness). This could be helpful when deciding which rows to prioritize over others when de-duplicating/slicing.

In this example, “key” columns over which you want to measure completeness are saved in a vector of column names.

Then the new column key_completeness is created with mutate(). The new value in each row is defined as a calculated fraction: the number of non-missing values in that row among the key columns, divided by the number of key columns.

This involves the function rowSums() from base R. Also used is ., which within piping refers to the dataframe at that point in the pipe (in this case, it is being subset with brackets []).

*Scroll to the right to see more rows**

# create a "key variable completeness" column
# this is a *proportion* of the columns designated as "key_vars" that have non-missing values

key_cols = c("personID", "name", "symptoms_ever")

obs %>% 
  mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) 

Roll-up values

This tab describes:

  1. How to “roll-up” values from multiple rows into just one row, with some variations
  2. Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell

This tab uses the example dataset from the Preparation tab.

Roll-up values into one row

The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:

  • A suffix is appended to all new columns ("_roll" in this example)
  • If you want to show only unique values per cell, then wrap the na.omit() with unique()
  • na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)

Scroll to the left to see more rows

# "Roll-up" values into one row per group (per "personID") 
cases_rolled <- obs %>% 
  
  # create groups by name
  group_by(personID) %>% 
  
  # order the rows within each group (e.g. by date)
  arrange(date, .by_group = TRUE) %>% 
  
  # For each column, paste together all values within the grouped rows, separated by ";"
  summarise(
    across(everything(),                           # apply to all columns
           ~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA values

The result is one row per group (ID), with entries arranged by date and pasted together.

This variation shows unique values only:

# Variation - show unique values only 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                                   # apply to all columns
           ~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA values

This variation appends a suffix to each column.
In this case "_roll" to signify that it has been rolled:

# Variation - suffix added to column names 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                
           list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column names

Overwrite values/hierarchy

If you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.

# CLEAN CASES
#############
cases_clean <- cases_rolled %>% 
    
    # clean Yes-No-Unknown vars: replace text with "highest" value present in the string
    mutate(across(c(contains("symptoms_ever")),                     # operates on specified columns (Y/N/U)
             list(mod = ~case_when(                                 # adds suffix "_mod" to new cols; implements case_when()
               
               str_detect(.x, "Yes")       ~ "Yes",                 # if "Yes" is detected, then cell value converts to yes
               str_detect(.x, "No")        ~ "No",                  # then, if "No" is detected, then cell value converts to no
               str_detect(.x, "Unknown")   ~ "Unknown",             # then, if "Unknown" is detected, then cell value converts to Unknown
               TRUE                        ~ as.character(.x)))),   # then, if anything else if it kept as is
      .keep = "unused")                                             # old columns removed, leaving only _mod columns

Now you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.

Resources

Much of the information in this page is adapted from these resources and vignettes online:

datanovia

dplyr tidyverse reference

cran janitor vignette

if/else & ‘for’ loops

Overview

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

if-else

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

‘for’ loops

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Tracking progress

for (row_origin in 1:nrow(ct_metrics)){
  # print progress
  if(row_origin %% 100==0){
    print(row_origin)
  }
  

Resources

apply functions

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

IV Analysis

Analysis

These pages are on data analysis!

Descriptive analysis

Overview

This tab demonstrates the use of gtsummary and dplyr to produce descriptive statistics.

  1. Browse data: get a quick overview of your dataset using the skimr package

  2. Summary statistics: mean, median, range, standard deviations, percentiles

  3. Frequency / cross-tabs: counts and proportions

  4. Statistical tests: t-tests, wilcoxon rank sum, kruskal-wallis and chi-squares

  5. Correlations

Preparation

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               skimr,        # get overview of data
               tidyverse,    # data management + ggplot2 graphics, 
               gtsummary,    # summary statistics and tests 
               corrr         # correlation analayis for numeric variables
               )

Load data

The example dataset used in this section:

  • Linelist of individual cases from a simulated epidemic

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

The first 50 rows of the linelist are displayed below.

Clean data

## make sure that age variable is numeric 
linelist <- linelist %>% 
  mutate(age = as.numeric(age))

Browse data

Browse data

Base R

You can use the summary function to get information about variables and data sets.

For a numeric variable it will give you the minimum, median, mean and max as well as the 1st quartile (= 25th percentile) and the 3rd quartile (= 75th percentile)

## get information about a numeric variable 
summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   15.09   22.00   67.00      88

You can also get an overview of each variable in a whole dataset.

## get information about each variable in a dataset 
summary(linelist)
##    case_id            generation    date_infection         date_onset         date_hospitalisation  date_outcome       
##  Length:5889        Min.   : 0.00   Min.   :2012-09-16   Min.   :2014-04-07   Min.   :2012-09-20   Min.   :2012-10-23  
##  Class :character   1st Qu.:13.00   1st Qu.:2014-09-06   1st Qu.:2014-09-16   1st Qu.:2014-09-19   1st Qu.:2014-09-26  
##  Mode  :character   Median :16.00   Median :2014-10-11   Median :2014-10-23   Median :2014-10-23   Median :2014-11-01  
##                     Mean   :16.56   Mean   :2014-10-22   Mean   :2014-11-02   Mean   :2014-11-03   Mean   :2014-11-12  
##                     3rd Qu.:20.00   3rd Qu.:2014-12-05   3rd Qu.:2014-12-19   3rd Qu.:2014-12-17   3rd Qu.:2014-12-28  
##                     Max.   :37.00   Max.   :2015-04-27   Max.   :2015-04-30   Max.   :2015-04-30   Max.   :2015-06-04  
##                                     NA's   :2087         NA's   :248                               NA's   :936         
##    outcome             gender               age          age_unit           age_years        age_cat        age_cat5   
##  Length:5889        Length:5889        Min.   : 0.00   Length:5889        Min.   : 0.00   5-9    :1148   5-9    :1148  
##  Class :character   Class :character   1st Qu.: 6.00   Class :character   1st Qu.: 6.00   20-29  :1091   0-4    :1081  
##  Mode  :character   Mode  :character   Median :13.00   Mode  :character   Median :13.00   0-4    :1081   10-14  : 971  
##                                        Mean   :15.09                      Mean   :15.04   10-14  : 971   15-19  : 837  
##                                        3rd Qu.:22.00                      3rd Qu.:22.00   15-19  : 837   20-24  : 600  
##                                        Max.   :67.00                      Max.   :67.00   (Other): 673   (Other):1164  
##                                        NA's   :88                         NA's   :88      NA's   :  88   NA's   :  88  
##    hospital              lon              lat          infector            source              wt_kg            ht_cm        
##  Length:5889        Min.   :-13.27   Min.   :8.446   Length:5889        Length:5889        Min.   :-15.97   Min.   :  6.783  
##  Class :character   1st Qu.:-13.25   1st Qu.:8.461   Class :character   Class :character   1st Qu.: 41.30   1st Qu.: 90.673  
##  Mode  :character   Median :-13.23   Median :8.469   Mode  :character   Mode  :character   Median : 54.37   Median :127.664  
##                     Mean   :-13.23   Mean   :8.470                                         Mean   : 52.75   Mean   :123.980  
##                     3rd Qu.:-13.22   3rd Qu.:8.480                                         3rd Qu.: 65.73   3rd Qu.:156.981  
##                     Max.   :-13.21   Max.   :8.492                                         Max.   :109.51   Max.   :282.197  
##                                                                                                                              
##     ct_blood        fever              chills             cough              aches              vomit                temp      
##  Min.   :16.00   Length:5889        Length:5889        Length:5889        Length:5889        Length:5889        Min.   :35.28  
##  1st Qu.:20.00   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.:38.17  
##  Median :22.00   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :38.83  
##  Mean   :21.19                                                                                                  Mean   :38.56  
##  3rd Qu.:22.00                                                                                                  3rd Qu.:39.24  
##  Max.   :25.00                                                                                                  Max.   :40.76  
##                                                                                                                 NA's   :137    
##  time_admission     days_onset_hosp
##  Length:5889        Min.   : 0.00  
##  Class :character   1st Qu.: 1.00  
##  Mode  :character   Median : 1.00  
##                     Mean   : 2.06  
##                     3rd Qu.: 3.00  
##                     Max.   :22.00  
##                     NA's   :248

skimr package

Using the skimr package you can get a more detailed overview of each of the variables in your dataset.

## get information about each variable in a dataset 
skim(linelist)
(#tab:descriptive_browse_skimr)Data summary
Name linelist
Number of rows 5889
Number of columns 29
_______________________
Column type frequency:
character 13
Date 4
factor 2
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 0 1.00 6 6 0 5888 0
outcome 1323 0.78 5 7 0 2 0
gender 283 0.95 1 1 0 2 0
age_unit 0 1.00 5 6 0 2 0
hospital 0 1.00 5 36 0 6 0
infector 2088 0.65 6 6 0 2697 0
source 2088 0.65 5 7 0 2 0
fever 237 0.96 2 3 0 2 0
chills 237 0.96 2 3 0 2 0
cough 237 0.96 2 3 0 2 0
aches 237 0.96 2 3 0 2 0
vomit 237 0.96 2 3 0 2 0
time_admission 730 0.88 5 5 0 1081 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_infection 2087 0.65 2012-09-16 2015-04-27 2014-10-11 360
date_onset 248 0.96 2014-04-07 2015-04-30 2014-10-23 365
date_hospitalisation 0 1.00 2012-09-20 2015-04-30 2014-10-23 364
date_outcome 936 0.84 2012-10-23 2015-06-04 2014-11-01 372

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age_cat 88 0.99 FALSE 7 5-9: 1148, 20-: 1091, 0-4: 1081, 10-: 971
age_cat5 88 0.99 FALSE 14 5-9: 1148, 0-4: 1081, 10-: 971, 15-: 837

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
generation 0 1.00 16.56 5.79 0.00 13.00 16.00 20.00 37.00 ▁▆▇▂▁
age 88 0.99 15.09 11.23 0.00 6.00 13.00 22.00 67.00 ▇▅▂▁▁
age_years 88 0.99 15.04 11.26 0.00 6.00 13.00 22.00 67.00 ▇▅▂▁▁
lon 0 1.00 -13.23 0.02 -13.27 -13.25 -13.23 -13.22 -13.21 ▅▃▃▆▇
lat 0 1.00 8.47 0.01 8.45 8.46 8.47 8.48 8.49 ▅▇▇▇▆
wt_kg 0 1.00 52.75 18.38 -15.97 41.30 54.37 65.73 109.51 ▁▂▇▆▁
ht_cm 0 1.00 123.98 48.90 6.78 90.67 127.66 156.98 282.20 ▂▅▇▃▁
ct_blood 0 1.00 21.19 1.68 16.00 20.00 22.00 22.00 25.00 ▁▃▅▇▁
temp 137 0.98 38.56 0.99 35.28 38.17 38.83 39.24 40.76 ▁▂▂▇▁
days_onset_hosp 248 0.96 2.06 2.27 0.00 1.00 1.00 3.00 22.00 ▇▁▁▁▁

Summary Statistics

gtsummary package

Using gtsummary you can create a table with different summary statistics, for example mean, median, range, standard deviation and percentiles. You can also show these all in one table.

Mean

Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column.

linelist %>% 
  ## only keep variable of interest
  select(age) %>% 
  ## create summary table with mean
  tbl_summary(statistic = age ~ "{mean}")
Characteristic N = 5,8891
age 15
Unknown 88

1 Mean

Median

Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).

linelist %>% 
  ## only keep variable of interest
  select(age) %>% 
  ## create summary table with median
  tbl_summary(statistic = age ~ "{median}")
Characteristic N = 5,8891
age 13
Unknown 88

1 Median

Range

The range here is the minimum and maximum values for the variable. (see percentiles for interquartile range) Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).

linelist %>% 
  ## only keep variable of interest
  select(age) %>% 
  ## create summary table with range 
  tbl_summary(statistic = age ~ "{min}, {max}")
Characteristic N = 5,8891
age 0, 67
Unknown 88

1 Range

Standard deviation

Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).

linelist %>% 
  ## only keep variable of interest
  select(age) %>% 
  ## create summary table with standard deviation
  tbl_summary(statistic = age ~ "{sd}")
Characteristic N = 5,8891
age 11
Unknown 88

1 SD

Percentile

To return percentiles you can type in one value that you would like, or you can type in multiple (e.g. to return the interquartile range).

Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).

linelist %>% 
  ## only keep variable of interest
  select(age) %>% 
  ## create summary table with interquartile range 
  tbl_summary(statistic = age ~ "{p25}, {p75}")
Characteristic N = 5,8891
age 6, 22
Unknown 88

1 IQR

Combined table

You can combine all of the previously shown elements in one table by choosing which statistics you want to show. To do this you need to tell the function that you want to get a table back by entering the type as “continuous2”.

Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).

linelist %>% 
  ## only keep variable of interest
  select(age) %>% 
  ## create summary table with interquartile range 
  tbl_summary(
    ## tell the function you want to get multiple statistics back 
    type = age ~ "continuous2",
    ## define which statistics you want to get back 
    statistic = age ~ c(
    "{mean} ({sd})", 
    "{median} ({p25}, {p75})",
    "{min}, {max}")
    )
Characteristic N = 5,889
age
Mean (SD) 15 (11)
Median (IQR) 13 (6, 22)
Range 0, 67
Unknown 88

dplyr package

You can also use dplyr to create a table with different summary statistics, for example mean, median, range, standard deviation and percentiles. You can also show these all in one table. The difference with using dplyr is that the output is not automatically formatted as nicely as with gtsummary

Mean

Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

linelist %>% 
  ## get the mean value of age while excluding missings
  summarise(mean = mean(age, na.rm = TRUE))
##       mean
## 1 15.09205

Median

Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

linelist %>% 
  ## get the median value of age while excluding missings
  summarise(median = median(age, na.rm = TRUE))
##   median
## 1     13

Range

Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

linelist %>% 
  ## get the range value of age while excluding missings
  summarise(range = range(age, na.rm = TRUE))
##   range
## 1     0
## 2    67

Standard Deviation

Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

linelist %>% 
  ## get the range value of age while excluding missings
  summarise(sd = sd(age, na.rm = TRUE))
##         sd
## 1 11.23393

Percentile

To return percentiles you can type in one value that you would like, or you can type in multiple (e.g. to return the interquartile range).

Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

linelist %>% 
  ## get the default percentile values of age while excluding missings 
  ## these are 0%,  25%,  50%,  75%, 100%
  summarise(percentiles = quantile(age, na.rm = TRUE))
##   percentiles
## 1           0
## 2           6
## 3          13
## 4          22
## 5          67
linelist %>% 
  ## get specified percentile values of age while excluding missings 
  ## these are 0%, 50%,  75%, 98%
  summarise(percentiles = quantile(age,
                                   probs = c(.05, 0.5, 0.75, 0.98), 
                                   na.rm=TRUE))
##   percentiles
## 1           1
## 2          13
## 3          22
## 4          43

Combined table

You can combine all of the previously shown elements in one table by choosing which statistics you want to show. In dplyr you will need to use the str_c function from stringr to combine outputs for the IQR and the range in to one cell, separated by a comma.

Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).

linelist %>% 
  summarise(
    ## get the mean 
    mean = mean(age, na.rm = TRUE),
    ## get the standard deviation
    SD = sd(age, na.rm = TRUE),
    ## get the median 
    median = median(age, na.rm = TRUE), 
    ## collapse the IQR separated by a comma
    IQR = str_c(
      quantile(age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      ), 
    ## collapse the range separated by a comma
    Range = str_c(
      range(age, na.rm = TRUE), 
      collapse = ", "
    )
  )
##       mean       SD median   IQR Range
## 1 15.09205 11.23393     13 6, 22 0, 67

Frequency/cross-tabs

gtsummary package

TODO: Note that percentages are calculated without missings

Using gtsummary you can create a table with different counts and proportions for variables with two or more categories, as well as grouping by another variable.

One way table

To produce the counts of a single variable we can use the tbl_summary function. Note that here, the fever variable is yes/no (dichotomous) and tbl_summary automatically only presents the “yes” row. To show all levels you could use the type argument to choose categorical, e.g. tbl_summary(type = fever ~ "categorical").

linelist %>% 
  ## only keep the variable interested in
  select(fever) %>% 
  ## produce summary table
  tbl_summary()
Characteristic N = 5,8891
fever 4,517 (80%)
Unknown 237

1 n (%)

Multiple variable one way table

You can also show multiple variables below each other simply by adding them to select.

linelist %>% 
  ## only keep the variable interested in
  select(fever, gender) %>% 
  ## produce summary table
  tbl_summary()
Characteristic N = 5,8891
fever 4,517 (80%)
Unknown 237
gender
f 2,805 (50%)
m 2,801 (50%)
Unknown 283

1 n (%)

Two way table

There are two options to produce a two-by-two table (i.e. comparing two variables). One option is to use tbl_cross, however this function only accepts two variables at once. The option below with tbl_summary allows more variables.

linelist %>% 
  ## only keep the variable interested in
  select(fever, outcome, gender) %>% 
  ## produce summary table stratified by gender
  tbl_summary(by = gender) %>% 
  ## add a column for the totals
  add_overall()
## 283 observations missing `gender` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `gender` column before passing to `tbl_summary()`.
Characteristic Overall, N = 5,6061 f, N = 2,8051 m, N = 2,8011
fever 4,293 (80%) 2,145 (80%) 2,148 (80%)
Unknown 229 107 122
outcome
Death 2,460 (56%) 1,233 (57%) 1,227 (56%)
Recover 1,901 (44%) 949 (43%) 952 (44%)
Unknown 1,245 623 622

1 n (%)

Three way table

Producing counts based on three variables (adding a stratifier).

## TODO: add stratified tables when available 

# table_3vars <- table(linelist$fever, linelist$gender, linelist$outcome)
# 
# ftable(table_3vars)

dplyr package

Creating cross tabulations with dplyr is less straightforward, as this does not fit within the tidyverse dataset structure. It is still useful to demonstrate though as the data produced can be used for plotting reference ggplot section. Another option is to use the janitor package tabyl function.

One way table

Producing counts and proportions for a single variable. To see how to do this for multiple variables - reference for-loop section.

linelist %>% 
  ## count the variable of interest
  count(fever) %>% 
  ## calculate proportion 
  mutate(percentage = n / sum(n) * 100)
##   fever    n percentage
## 1    no 1135  19.273221
## 2   yes 4517  76.702326
## 3  <NA>  237   4.024452

Two way table

Producing counts and proportions based on a grouping variable. Here we use the dplyr group_by function, for more information see the reference grouping and aggregating section. You can calculate the percentages of the total by using ungroup() after count(...).

Note that it is possible to change the bellow table to wide format, making it more like a two-by-two (cross tabulation), using the tidyr pivot_wider function. This would be done by adding this to the end of the code blow: pivot_wider(names_from = gender, values_from = c(n, percentage)) For more information see the reference pivot section.

linelist %>% 
  ## do everything by gender 
  group_by(gender) %>% 
  ## count the variable of interest
  count(fever) %>% 
  ## calculate proportion 
  ## note that the denominator here is the sum of each gender
  mutate(percentage = n / sum(n) * 100)
## # A tibble: 9 x 4
## # Groups:   gender [3]
##   gender fever     n percentage
##   <chr>  <chr> <int>      <dbl>
## 1 f      no      553      19.7 
## 2 f      yes    2145      76.5 
## 3 f      <NA>    107       3.81
## 4 m      no      531      19.0 
## 5 m      yes    2148      76.7 
## 6 m      <NA>    122       4.36
## 7 <NA>   no       51      18.0 
## 8 <NA>   yes     224      79.2 
## 9 <NA>   <NA>      8       2.83

Three way table

Producing counts based on three variables (adding a stratifier).

linelist %>% 
  ## do everything by gender and outcome 
  group_by(gender, outcome) %>% 
  ## count the variable of interest 
  count(fever) %>% 
  ## calculate the proportion
  ## note that the denominator here is the sum of each group combination
  mutate(percentage = n / sum(n) * 100)
## # A tibble: 27 x 5
## # Groups:   gender, outcome [9]
##    gender outcome fever     n percentage
##    <chr>  <chr>   <chr> <int>      <dbl>
##  1 f      Death   no      239      19.4 
##  2 f      Death   yes     941      76.3 
##  3 f      Death   <NA>     53       4.30
##  4 f      Recover no      188      19.8 
##  5 f      Recover yes     725      76.4 
##  6 f      Recover <NA>     36       3.79
##  7 f      <NA>    no      126      20.2 
##  8 f      <NA>    yes     479      76.9 
##  9 f      <NA>    <NA>     18       2.89
## 10 m      Death   no      235      19.2 
## # ... with 17 more rows

Statistical tests

gtsummary package

Performing statistical tests of comparison with tbl_summary is done by using add_p function and specifying which test to use. It is possible to get p-values corrected for multiple testing by using the add_q function.

T-tests

Compare the difference in means for a continuous variable in two groups. For example compare the mean age by patient outcome.

linelist %>% 
  ## only keep variables of interested
  select(age, outcome) %>% 
  ## produce summary table
  tbl_summary(
    ## specify what statistic want to show
    statistic = age ~ "{mean} ({sd})", 
    ## specify the grouping variable
    by = outcome) %>% 
  ## specify what test want to perform
  add_p(age ~ "t.test")
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9841 p-value2
age 15 (11) 15 (11) 0.4
Unknown 34 27

1 Mean (SD)

2 Welch Two Sample t-test

Wilcoxon rank sum test

Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.

linelist %>% 
  ## only keep variables of interested
  select(age, outcome) %>% 
  ## produce summary table
  tbl_summary(
    ## specify what statistic want to show (default so could remove)
    statistic = age ~ "{median} ({p25}, {p75})", 
    ## specify the grouping variable
    by = outcome) %>% 
  ## specify what test want to perform (default so could leave brackets empty)
  add_p(age ~ "wilcox.test")
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9841 p-value2
age 13 (6, 22) 13 (6, 21) 0.6
Unknown 34 27

1 Median (IQR)

2 Wilcoxon rank sum test

Kruskal-wallis test

Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.

linelist %>% 
  ## only keep variables of interested
  select(age, outcome) %>% 
  ## produce summary table
  tbl_summary(
    ## specify what statistic want to show (default so could remove)
    statistic = age ~ "{median} ({p25}, {p75})", 
    ## specify the grouping variable
    by = outcome) %>% 
  ## specify what test want to perform
  add_p(age ~ "kruskal.test")
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9841 p-value2
age 13 (6, 22) 13 (6, 21) 0.6
Unknown 34 27

1 Median (IQR)

2 Kruskal-Wallis rank sum test

Chi-squared test

Compare the proportions of a categorical variable in two groups. The default is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.

linelist %>% 
  ## only keep variables of interested
  select(gender, outcome) %>% 
  ## produce summary table
  tbl_summary(
    ## specify the grouping variable
    by = outcome
  ) %>% 
  ## specify what test want to perform
  add_p()
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9841 p-value2
gender 0.9
f 1,233 (50%) 949 (50%)
m 1,227 (50%) 952 (50%)
Unknown 122 83

1 n (%)

2 Pearson's Chi-squared test

dplyr package

Performing statistical tests in dplyr alone is very dense, again because it does not fit within the tidy-data framework. It requires using purrr to create a list of dataframes for each of the subgroups you want to compare. An easier alternative may be the rstatix package.

T-tests

linelist %>% 
  ## only keep variables of interest
  select(age, outcome) %>% 
  ## drop those missing outcome 
  filter(!is.na(outcome)) %>% 
  ## specify the grouping variable
  group_by(outcome) %>% 
  ## create a subset of data for each group (as a list)
  nest() %>% 
  ## spread in to wide format
  pivot_wider(names_from = outcome, values_from = data) %>% 
  mutate(
    ## calculate the mean age for the death group
    Death_mean = map(Death, ~mean(.x$age, na.rm = TRUE)),
    ## calculate the sd among dead 
    Death_sd = map(Death, ~sd(.x$age, na.rm = TRUE)),
    ## calculate the mean age for the recover group
    Recover_mean = map(Recover, ~mean(.x$age, na.rm = TRUE)), 
    ## calculate the sd among recovered 
    Recover_sd = map(Recover, ~sd(.x$age, na.rm = TRUE)),
    ## using both grouped data sets compare mean age with a t-test
    ## keep only the p.value
    t_test = map2(Death, Recover, ~t.test(.x$age, .y$age)$p.value)
  ) %>% 
  ## drop datasets 
  select(-Death, -Recover) %>% 
  ## return a dataset with the medians and p.value (drop missing)
  unnest(cols = everything())
## # A tibble: 1 x 5
##   Death_mean Death_sd Recover_mean Recover_sd t_test
##        <dbl>    <dbl>        <dbl>      <dbl>  <dbl>
## 1       15.1     11.3         14.8       11.0  0.445

Wilcoxon rank sum test

linelist %>% 
  ## only keep variables of interest
  select(age, outcome) %>% 
  ## drop those missing outcome 
  filter(!is.na(outcome)) %>% 
  ## specify the grouping variable
  group_by(outcome) %>% 
  ## create a subset of data for each group (as a list)
  nest() %>% 
  ## spread in to wide format
  pivot_wider(names_from = outcome, values_from = data) %>% 
  mutate(
    ## calculate the median age for the death group
    Death_median = map(Death, ~median(.x$age, na.rm = TRUE)),
    ## calculate the sd among dead 
    Death_iqr = map(Death, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## calculate the median age for the recover group
    Recover_median = map(Recover, ~median(.x$age, na.rm = TRUE)), 
    ## calculate the sd among recovered 
    Recover_iqr = map(Recover, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## using both grouped data sets compare age distribution with a wilcox test
    ## keep only the p.value
    wilcox = map2(Death, Recover, ~wilcox.test(.x$age, .y$age)$p.value)
  ) %>% 
  ## drop datasets 
  select(-Death, -Recover) %>% 
  ## return a dataset with the medians and p.value (drop missing)
  unnest(cols = everything())
## # A tibble: 1 x 5
##   Death_median Death_iqr Recover_median Recover_iqr wilcox
##          <dbl> <chr>              <dbl> <chr>        <dbl>
## 1           13 6, 22                 13 6, 21        0.608

Kruskal-wallis test

linelist %>% 
  ## only keep variables of interest
  select(age, outcome) %>% 
  ## drop those missing outcome 
  filter(!is.na(outcome)) %>% 
  ## specify the grouping variable
  group_by(outcome) %>% 
  ## create a subset of data for each group (as a list)
  nest() %>% 
  ## spread in to wide format
  pivot_wider(names_from = outcome, values_from = data) %>% 
  mutate(
    ## calculate the median age for the death group
    Death_median = map(Death, ~median(.x$age, na.rm = TRUE)),
    ## calculate the sd among dead 
    Death_iqr = map(Death, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## calculate the median age for the recover group
    Recover_median = map(Recover, ~median(.x$age, na.rm = TRUE)), 
    ## calculate the sd among recovered 
    Recover_iqr = map(Recover, ~str_c(
      quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE), 
      collapse = ", "
      )),
    ## using the original data set compare age distribution with a kruskal test
    ## keep only the p.value
    kruskal = kruskal.test(linelist$age, linelist$outcome)$p.value
  ) %>% 
  ## drop datasets 
  select(-Death, -Recover) %>% 
  ## return a dataset with the medians and p.value (drop missing)
  unnest(cols = everything())
## # A tibble: 1 x 5
##   Death_median Death_iqr Recover_median Recover_iqr kruskal
##          <dbl> <chr>              <dbl> <chr>         <dbl>
## 1           13 6, 22                 13 6, 21         0.608

Chi-squared test

linelist %>% 
  ## do everything by gender 
  group_by(outcome) %>% 
  ## count the variable of interest
  count(gender) %>% 
  ## calculate proportion 
  ## note that the denominator here is the sum of each gender
  mutate(percentage = n / sum(n) * 100) %>% 
  pivot_wider(names_from = outcome, values_from = c(n, percentage)) %>% 
  filter(!is.na(gender)) %>% 
  mutate(pval = chisq.test(linelist$gender, linelist$outcome)$p.value)
## # A tibble: 2 x 8
##   gender n_Death n_Recover  n_NA percentage_Death percentage_Recover percentage_NA  pval
##   <chr>    <int>     <int> <int>            <dbl>              <dbl>         <dbl> <dbl>
## 1 f         1233       949   623             47.8               47.8          47.1 0.920
## 2 m         1227       952   622             47.5               48.0          47.0 0.920

base package

You can also just use the base functions to produce the results of statistical tests. The outputs of these are however usually lists, and so are harder to manipulate.

T-tests

## compare mean age by outcome group with a t-test
t.test(age ~ outcome, data = linelist)
## 
##  Welch Two Sample t-test
## 
## data:  age by outcome
## t = 0.76363, df = 4261.3, p-value = 0.4451
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4007016  0.9120010
## sample estimates:
##   mean in group Death mean in group Recover 
##              15.07732              14.82167

Wilcoxon rank sum test

## compare age distribution by outcome group with a wilcox test
wilcox.test(age ~ outcome, data = linelist)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age by outcome
## W = 2515431, p-value = 0.6075
## alternative hypothesis: true location shift is not equal to 0

Kruskal-wallis test

## compare age distribution by outcome group with a kruskal-wallis test
kruskal.test(age ~ outcome, linelist)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  age by outcome
## Kruskal-Wallis chi-squared = 0.26378, df = 1, p-value = 0.6075

Chi-squared test

## compare the proportions in each group with a chi-squared test
chisq.test(linelist$gender, linelist$outcome)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  linelist$gender and linelist$outcome
## X-squared = 0.010203, df = 1, p-value = 0.9195

Correlations

Correlation between numeric variables can be investigated using the tidyverse corrr package. It allows you to compute correlations using Pearson, Kendall tau or Spearman rho. The package creates a table and also has a function to automatically plot the values.

correlation_tab <- linelist %>% 
  ## pick the numeric variables of interest
  select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>% 
  ## create correlation table (using default pearson)
  correlate()
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## remove duplicate entries (the table is mirrored) 
correlation_tab <- correlation_tab %>% 
  shave()


## view correlation table 
correlation_tab
## # A tibble: 6 x 7
##   term            generation     age ct_blood days_onset_hosp  wt_kg ht_cm
##   <chr>                <dbl>   <dbl>    <dbl>           <dbl>  <dbl> <dbl>
## 1 generation         NA      NA      NA               NA      NA        NA
## 2 age                -0.0144 NA      NA               NA      NA        NA
## 3 ct_blood            0.184  -0.0108 NA               NA      NA        NA
## 4 days_onset_hosp    -0.289  -0.0147 -0.600           NA      NA        NA
## 5 wt_kg              -0.0153  0.840  -0.00763         -0.0153 NA        NA
## 6 ht_cm              -0.0150  0.882  -0.00907         -0.0121  0.885    NA
## plot correlations 
rplot(correlation_tab)
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

Resources

Much of the information in this page is adapted from these resources and vignettes online:

gtsummary dplyr corrr sthda correlation

Simple statistical tests

Overview

This tab demonstrates the use of gtstummary and regression packages to look at associations between variables (e.g. odds ratios, risk ratios and hazard ratios)

  1. Univariate: two-by-two tables
  2. Stratified: mantel-haenszel estimates
  3. Multivariable: variable selection, model selection, final table
  4. Forest plot

Preparation

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics, 
               stringr,      # manipulate text strings 
               purrr,        # loop over objects in a tidy way
               gtsummary,    # summary statistics and tests 
               broom,        # tidy up results from regressions
               parameters,   # alternative to tidy up results from regressions
               see
               )

Load data

The example dataset used in this section:

  • Linelist of individual cases from a simulated epidemic

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

The first 50 rows of the linelist are displayed below.

Clean data

## make sure that age variable is numeric 
linelist <- linelist %>% 
  mutate(age = as.numeric(age))

## define variables of interest 
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

## make dichotomous variables in to 0/1 
linelist <- linelist %>% 
  mutate(
    ## for each of the variables listed
    across(
      all_of(c(explanatory_vars, "outcome")), 
      ## recode male, yes and death to 1; female, no and recover to 0
      ## otherwise set to missing
           ~case_when(
             . %in% c("m", "yes", "Death")   ~ 1,
             . %in% c("f", "no",  "Recover") ~ 0, 
             TRUE ~ NA_real_
           ))
  )

## add in age_category to the explanatory vars 
explanatory_vars <- c(explanatory_vars, "age_cat")

## drop rows with missing information for variables of interest 
linelist <- linelist %>% 
  drop_na(any_of(c("outcome", explanatory_vars)))

Univariate

There are two options for doing univariate analysis. You can use the gtsummary package or you can use the individual regression functions available in base together with the broom package.

gtsummary package

univ_tab <- linelist %>% 
  ## select variables of interest
  select(explanatory_vars, outcome) %>% 
  ## produce univariate table
  tbl_uvregression(
    ## define regression want to run (generalised linear model)
    method = glm, 
    ## define outcome variable
    y = outcome, 
    ## define what type of glm want to run (logistic)
    method.args = list(family = binomial), 
    ## exponentiate the outputs to produce odds ratios (rather than log odds)
    exponentiate = TRUE
    )

## view univariate results table 
univ_tab
Characteristic N OR1 95% CI1 p-value
gender 4,172 0.99 0.88, 1.12 >0.9
fever 4,172 1.03 0.89, 1.20 0.7
chills 4,172 1.02 0.88, 1.19 0.8
cough 4,172 1.01 0.85, 1.20 >0.9
aches 4,172 0.89 0.72, 1.08 0.2
vomit 4,172 0.99 0.87, 1.12 0.8
age_cat 4,172
0-4
5-9 1.04 0.86, 1.27 0.7
10-14 1.03 0.84, 1.27 0.8
15-19 1.09 0.88, 1.35 0.4
20-29 1.02 0.84, 1.25 0.8
30-49 1.03 0.81, 1.30 0.8
50-69 1.47 0.73, 3.12 0.3

1 OR = Odds Ratio, CI = Confidence Interval

base

Using the glm function from the stats package (part of base R), you can produce odds ratios.

For a single exposure variable, pass the names to glm and then use tidy from the broom package to get the exponentiated odds ratio estimates and confidence intervals. Here we demonstrate how to combine model outputs with a table of counts.

model <- glm(
  ## define the variables of interest
  outcome ~ age_cat, 
  ## define the type of regression (logistic)
  family = "binomial", 
  ## define your dataset
  data = linelist) %>% 
  ## clean up the outputs of the regression (exponentiate and produce CIs)
  tidy(
      exponentiate = TRUE, 
      conf.int = TRUE)


linelist %>% 
  ## get counts of variable of interest grouped by outcome
  group_by(outcome) %>% 
  count(age_cat) %>% 
  ## spread to wide format (as in cross-tabulation)
  pivot_wider(names_from = outcome, values_from = n) %>% 
  ## drop rows with missings
  filter(!is.na(age_cat)) %>% 
  ## merge with the outputs of the regression 
  bind_cols(., model) %>% 
  ## only keep columns interested in 
  select(term, 2:3, estimate, conf.low, conf.high, p.value)
## # A tibble: 7 x 7
##   term           `0`   `1` estimate conf.low conf.high p.value
##   <chr>        <int> <int>    <dbl>    <dbl>     <dbl>   <dbl>
## 1 (Intercept)    354   440     1.24    1.08       1.43 0.00232
## 2 age_cat5-9     357   462     1.04    0.855      1.27 0.688  
## 3 age_cat10-14   306   393     1.03    0.842      1.27 0.754  
## 4 age_cat15-19   259   352     1.09    0.884      1.35 0.411  
## 5 age_cat20-29   346   439     1.02    0.837      1.25 0.839  
## 6 age_cat30-49   189   241     1.03    0.810      1.30 0.832  
## 7 age_cat50-69    12    22     1.47    0.732      3.12 0.288

To run over several exposure variables to produce univariate odds ratios (i.e.  not controlling for each other), you can pass a vector of variable names to the map function in the purrr package. This will loop over each of the variables running regressions for each one.

models <- explanatory_vars %>% 
  ## combine each name of the variables of interest with the name of outcome variable
  str_c("outcome ~ ", .) %>% 
  ## for each string above (outcome ~ "variable of interest)
  map(
    ## run a general linear model 
    ~glm(
      ## define formula as each of the strings above
      as.formula(.x), 
      ## define type of glm (logistic)
      family = "binomial", 
      ## define your dataset
      data = linelist)
  ) %>% 
  ## for each of the output regressions from above 
  map(
    ## tidy the output
    ~tidy(
      ## each of the regressions 
      .x, 
      ## exponentiate and produce CIs
      exponentiate = TRUE, 
      conf.int = TRUE)
  ) %>% 
  ## collapse the list of regressions outputs in to one data frame
  bind_rows()



## for each explanatory variable
univ_tab_base <- map(explanatory_vars, 
      ~{linelist %>% 
          ## group data set by outcome
          group_by(outcome) %>% 
          ## produce counts for variable of interest
          count(.data[[.x]]) %>% 
          ## spread to wide format (as in cross-tabulation)
          pivot_wider(names_from = outcome, values_from = n) %>% 
          ## drop rows with missings
          filter(!is.na(.data[[.x]])) %>% 
          ## change the variable of interest column to be called "variable"
          rename("variable" = .x) %>% 
          ## change the variable of interest column to be a character 
          ## otherwise non-dichotomous (categorical) variables come out as factor and cant be merged
          mutate(variable = as.character(variable))
                 }
      ) %>% 
  ## collapse the list of count outputs in to one data frame
  bind_rows() %>% 
  ## merge with the outputs of the regression 
  bind_cols(., models) %>% 
  ## only keep columns interested in 
  select(term, 2:3, estimate, conf.low, conf.high, p.value)

Stratified

Stratified analysis is currently still being worked on for gtsummary, this page will be updated in due course.

gtsummary package

TODO

base

TODO

Multivariable

For multivariable analysis you can use a combination there is not much difference between using gtsummary or broom to present the data. The workflow is the same for both, as below, and only the last step of pulling a table together is different.

## run a regression with all variables of interest 
mv_reg <- explanatory_vars %>% 
  ## combine all names of the variables of interest separated by a plus
  str_c(collapse = "+") %>% 
  ## combined the names of variables of interest with outcome in formula style
  str_c("outcome ~ ", .) %>% 
  glm(## define type of glm (logistic)
      family = "binomial", 
      ## define your dataset
      data = linelist) 

## choose a model using forward selection based on AIC
## you can also do "backward" or "both" by adjusting the direction
final_mv_reg <- mv_reg %>%
  step(direction = "forward", trace = FALSE)

gtsummary package

The gtsummary package provides the tbl_regression function, which will take the outputs from a regression (glm in this case) and produce an easy summary table. You can also combine several different output tables produced by gtsummary with the tbl_mege function.

## show results table of final regression 
mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)

## combine with univariate results 
tbl_merge(
  tbls = list(univ_tab, mv_tab), 
  tab_spanner = c("**Univariate**", "**Multivariable**"))
Characteristic Univariate Multivariable
N OR1 95% CI1 p-value OR1 95% CI1 p-value
gender 4,172 0.99 0.88, 1.12 >0.9 0.99 0.87, 1.12 0.9
fever 4,172 1.03 0.89, 1.20 0.7 1.03 0.89, 1.20 0.7
chills 4,172 1.02 0.88, 1.19 0.8 1.02 0.88, 1.19 0.8
cough 4,172 1.01 0.85, 1.20 >0.9 1.01 0.85, 1.20 >0.9
aches 4,172 0.89 0.72, 1.08 0.2 0.88 0.72, 1.08 0.2
vomit 4,172 0.99 0.87, 1.12 0.8 0.98 0.87, 1.11 0.8
age_cat 4,172
0-4
5-9 1.04 0.86, 1.27 0.7 1.05 0.86, 1.27 0.7
10-14 1.03 0.84, 1.27 0.8 1.04 0.84, 1.27 0.7
15-19 1.09 0.88, 1.35 0.4 1.10 0.89, 1.36 0.4
20-29 1.02 0.84, 1.25 0.8 1.02 0.84, 1.25 0.8
30-49 1.03 0.81, 1.30 0.8 1.03 0.81, 1.31 0.8
50-69 1.47 0.73, 3.12 0.3 1.48 0.73, 3.14 0.3

1 OR = Odds Ratio, CI = Confidence Interval

base

mv_tab_base <- final_mv_reg %>% 
  ## get a tidy dataframe of estimates 
  broom::tidy(exponentiate = TRUE, conf.int = TRUE)

## combine univariate and multivariable tables 
left_join(univ_tab_base, mv_tab_base, by = "term") %>% 
  ## choose columns and rename them
  select(
    "characteristic" = term, 
    "recovered"      = "0", 
    "dead"           = "1", 
    "univ_or"        = estimate.x, 
    "univ_ci_low"    = conf.low.x, 
    "univ_ci_high"   = conf.high.x,
    "univ_pval"      = p.value.x, 
    "mv_or"          = estimate.y, 
    "mvv_ci_low"     = conf.low.y, 
    "mv_ci_high"     = conf.high.y,
    "mv_pval"        = p.value.y 
  )
## # A tibble: 19 x 11
##    characteristic recovered  dead univ_or univ_ci_low univ_ci_high univ_pval mv_or mvv_ci_low mv_ci_high mv_pval
##    <chr>              <int> <int>   <dbl>       <dbl>        <dbl>     <dbl> <dbl>      <dbl>      <dbl>   <dbl>
##  1 (Intercept)          913  1180   1.29        1.19          1.41  5.88e- 9 1.22       0.946       1.58   0.125
##  2 gender               910  1169   0.994       0.879         1.12  9.22e- 1 0.989      0.872       1.12   0.864
##  3 (Intercept)          377   474   1.26        1.10          1.44  9.07e- 4 1.22       0.946       1.58   0.125
##  4 fever               1446  1875   1.03        0.886         1.20  6.90e- 1 1.03       0.888       1.20   0.667
##  5 (Intercept)         1469  1885   1.28        1.20          1.37  7.80e-13 1.22       0.946       1.58   0.125
##  6 chills               354   464   1.02        0.876         1.19  7.87e- 1 1.02       0.877       1.19   0.767
##  7 (Intercept)          263   337   1.28        1.09          1.51  2.58e- 3 1.22       0.946       1.58   0.125
##  8 cough               1560  2012   1.01        0.845         1.20  9.42e- 1 1.01       0.848       1.20   0.910
##  9 (Intercept)         1629  2125   1.30        1.22          1.39  6.93e-16 1.22       0.946       1.58   0.125
## 10 aches                194   224   0.885       0.723         1.08  2.38e- 1 0.885      0.722       1.08   0.238
## 11 (Intercept)          927  1202   1.30        1.19          1.41  2.79e- 9 1.22       0.946       1.58   0.125
## 12 vomit                896  1147   0.987       0.874         1.12  8.37e- 1 0.985      0.871       1.11   0.806
## 13 (Intercept)          354   440   1.24        1.08          1.43  2.32e- 3 1.22       0.946       1.58   0.125
## 14 age_cat5-9           357   462   1.04        0.855         1.27  6.88e- 1 1.05       0.859       1.27   0.656
## 15 age_cat10-14         306   393   1.03        0.842         1.27  7.54e- 1 1.04       0.844       1.27   0.730
## 16 age_cat15-19         259   352   1.09        0.884         1.35  4.11e- 1 1.10       0.886       1.36   0.398
## 17 age_cat20-29         346   439   1.02        0.837         1.25  8.39e- 1 1.02       0.838       1.25   0.818
## 18 age_cat30-49         189   241   1.03        0.810         1.30  8.32e- 1 1.03       0.809       1.31   0.819
## 19 age_cat50-69          12    22   1.47        0.732         3.12  2.88e- 1 1.48       0.733       3.14   0.285

Forest plot

This section shows how to produce a plot with the outputs of your regression. There are two options, you can build a plot yourself using ggplot2 or use a package called

ggplot2 package

## remove the intercept term from your multivariable results
mv_tab_base %>% 
  filter(term != "(Intercept)") %>% 
  ## plot with variable on the y axis and estimate (OR) on the x axis
  ggplot(aes(x = estimate, y = term)) +
  ## show the estimate as a point
  geom_point() + 
  ## add in an error bar for the confidence intervals
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) + 
  ## show where OR = 1 is for reference as a dashed line
  geom_vline(xintercept = 1, linetype = "dashed")

easytats packages

The alternative if you do not want to decide all of the different things required for a ggplot, is to use a combination of easystats packages. In this case the paramaters package function model_paramets does the equivalent of broom package function tidy. The see package then accepts those outputs and creates a default forest plot as a ggplot object.

## remove the intercept term from your multivariable results
final_mv_reg %>% 
  model_parameters(exponentiate = TRUE) %>% 
  plot()

Resources

Much of the information in this page is adapted from these resources and vignettes online:

gtsummary

sthda stepwise regression

Moving averages

Overview

This page will cover methods to calculate and visualize moving averages, for:

To see a moving average for an epicurve, see the page on epicurves (LINK)

Preparation

Load packages

pacman::p_load(
  tidyverse,      # for data management and viz
  slider,         # for calculating moving averages
  tidyquant,      # for calculating moving averages on-the-fly in ggplot
)
## 
## Your package installed
## Warning in pacman::p_load(tidyverse, slider, tidyquant, ): Failed to install/load:

Calculate-then-display

Using the package slider to calculate a moving average in a dataframe, prior to any plotting.

In this approach, the moving average is calculated in the dataset prior to plotting:

  • Within mutate(), a new column is created to hold the average. slide_index() from slider package is used as shown below.
  • In the ggplot(), a geom_line() is added after the histogram, reflecting the moving average.

See the helpful online vignette for the slider package

  • Can assign .before = Inf to achieve cumulative averages from the first row
  • Use slide() in simple cases
  • Use slide_index() to designate a date column as an index, so that dates which do not appear in the dataframe are still included in the window
    • .before, .after TODO
    • .complete TODO

First we count the number of cases reported each day. Note that count() is appropriate if the data are in a linelist format (one row per case) - if starting with aggregated counts you will need to follow a different approach (e.g. summarize() - see page on Summarizing data).

# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>% 
  count(date_onset, name = "new_cases") %>%    # count cases by date, new column is named "new_cases"
  filter(!is.na(date_onset))

The new dataset now looks like this:

DT::datatable(ll_counts_7day, rownames = FALSE, options = list(pageLength = 6, scrollX=T) )

Next, we create a new column that is the 7-day average. We are using the function slide_index() from slider specifically because we recognize that there are missing days in the above dataframe, and they must be accounted for. To do this, we set a our “index” (.i argument) as the columndate_onset. Sincedate_onsetis a column of class Date, the function recognizes and when calculating it counts the days that do not appear in the dataframe. If you were to use another **slider** function likeslide()`, this indexing would not occur.

Also not that the 7-day window, in this example, is achieved with the argument .before = 6. In this way the window is the day and 6 days preceding. If you want the window to be different (centered or following) use .after in conjunction.

## calculate the average number of cases in the preceding 7 days
ll_counts_7day <- ll_counts_7day %>% 
  mutate(
    avg_7day = slider::slide_index_dbl(    # create new column
        new_cases,                       # calculate avg based on value in new_cases column
        .i = date_onset,                 # index column is date_onset, so non-present dates are included in 7day window 
        .f = ~mean(.x, na.rm = TRUE),    # function is mean() with missing values removed
        .before = 6,                     # window is the day and 6-days before
        .complete = TRUE))               # fills in first days with NA

Step 2 is plotting the 7-day average, in this case shown on top of the underlying daily data.

ggplot(data = ll_counts_7day, aes(x = date_onset)) +
    geom_histogram(aes(y = new_cases), fill="#92a8d1", stat = "identity", position = "stack", colour = "#92a8d1")+ 
    geom_line(aes(y = avg_7day), color="red", size = 1) + 
    scale_x_date(
      date_breaks = "1 month",
      date_labels = '%d/%m',
      expand = c(0,0)) +
    scale_y_continuous(expand = c(0,0), limits = c(0, NA)) + 
    labs(x="", y ="Number of confirmed cases")+ 
    theme_minimal() 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 1 row(s) containing missing values (geom_path).

Calculate on-the-fly

TBD - tidyquant

per_pos_plot_county <- ggplot(data = filter(tests_per_county),
       aes(x = DtSpecimenCollect_Final, y = prop_pos))+
  geom_line(size = 1, alpha = 0.2)+  # plot raw values
  tidyquant::geom_ma(n=7, size = 2)+ # plot moving average
  theme_minimal_hgrid()+
  coord_cartesian(xlim = c(as.Date("2020-03-15"), Sys.Date()), ylim = c(0, 15))+
  labs(title    = "COUNTY-WIDE TESTING PERCENT POSITIVE",
       subtitle = "Daily and 7-day moving average",
       y        = "Percent Positive",
       x        = "Date Specimen Collected")+
  theme_text_size+
  theme(axis.text = element_text(face = "bold", size = 14),
        panel.background = element_rect(fill = "khaki")
        )

Resources

See the helpful online vignette for the slider package

If your use case requires that you “skip over” weekends and even holidays, you might like almanac package.

Outbreak detection

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Endemic corridor analysis Detecting spikes in syndromic/routine surveillance

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Time series analysis

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Epidemic modeling

Overview

R(t) estimations Doubling times Projections

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Modeling

Overview

UNDER CONSTRUCTION

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Overview

Tidymodels

Logistic Regression

Multi-level modeling Regression

Survival analysis

Multi-stage Markov models

Liza Coyer TODO this? logitudinal data

Tables of model results

Causal diagrams

Survey analysis

UNDER CONSTRUCTION

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

From data frame

Overview

Weighting

Random selection

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Survival analysis

UNDER CONSTRUCTION

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

GIS basics

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Plotting coordinates

polygons and shapefiles

Simple analyses

Distance to nearest X (HCF)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

V Data Vizualization

ggplot tips

UNDER CONSTRUCTION

https://www.tidyverse.org/blog/2018/07/ggplot2-3-0-0/

Overview

Embed ggplot cheatsheet

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Highlighting

highlighting one line among many etc gghighlight

Dual axes

Cowplot Complicated method (% 100 * …)

Smart Labeling

ggrepel

Time axes

Dual axes

Adding shapes

Animations

Epidemic curves

Overview

## 15 missing observations were removed.
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Preparation

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               lubridate,    # working with  dates    
               aweek,        # alternative package for working with dates
               incidence,    # an option for epicurves of linelist data
               stringr,      # Search and manipulate character strings
               forcats,      # working with factors
               RColorBrewer) # Color palettes from colorbrewer2.org

Load data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated epidemic
  • Aggregated counts by hospital from the same simulated epidemic

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.

For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.

## `summarise()` has grouped output by 'hospital'. You can override using the `.groups` argument.
# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Review the two datasets and notice the differences

Case linelist

The first 50 rows are displayed

Case counts aggregated by hospital

The first 50 rows are displayed

Set parameters

You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). You can then reference the data_date in the code when applying filters or in captions that auto-update.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")

Verify dates

Verify that each relevant date column is class Date and has an appropriate range of values. This for loop prints a histogram for each column.

# create character vector of column names 
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))

# Produce histogram of each date column
for (Col in DateCols) {     # open loop. iterate for each name in vector DateCols
  hist(linelist[, Col],     # print histogram of the column in linelist dataframe
       breaks = 50,         # number of breaks for the histogram
       xlab = Col)          # x-axis label is the name of the column
  }                         # close the loop

incidence package

incidence package

Below are tabs on making quick epicurves using the incidence package

CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.

TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.

https://cran.r-project.org/web/packages/incidence/vignettes/customize_plot.html#example-data-simulated-ebola-outbreak

Intro

Intro

2 steps are requires to plot an epicurve with the incidence package:

  1. Create an incidence object (using the function incidence())
    • Provide the case linelist
    • Specify the time interval into which the cases should be aggregated (daily, weekly, monthly..)
    • Specify any sub-groups
  2. Plot the incidence object
    • Specify labels, aesthetic themes, etc.

A simple example - an epicurve of daily cases:

# load incidence package
library(incidence)

# create the incidence object using data by day
epi_day   <- incidence(linelist$date_onset,  # the linelist data
                       interval = "day")     # the time interval
## 248 missing observations were removed.
# plot the incidence object
plot(epi_day)

Change time interval of case aggregation (bars)

The interval argument defines how the observations are grouped. Available options include all the options from the package aweek, including but not limited to:

  • “week” (Monday start day is default)
  • “2 weeks” (or 3, 4, 5…)
  • “Sunday week”
  • “2 Sunday weeks” (or 3, 4, 5…)
  • “MMWRweek” (starts on Sunday - see US CDC)
  • “month” (1st of month)
  • “quarter” (1st of month of quarter)
  • “2 months” (or 3, 4, 5…)
  • “year” (1st day of calendar year)

Below are examples of how different intervals look when applied to the linelist.
Format and frequency of the date labels on the x-axis are the defaults for the specified interval.

# Create the incidence objects (with different intervals)
##############################
# Weekly (Monday week by default)
epi_wk      <- incidence(linelist$date_onset, interval = "Monday week")
## 248 missing observations were removed.
# Sunday week
epi_Sun_wk  <- incidence(linelist$date_onset, interval = "Sunday week")
## 248 missing observations were removed.
# Three weeks (Monday weeks by default)
epi_3wk     <- incidence(linelist$date_onset, interval = "3 weeks")
## 248 missing observations were removed.
# Monthly
epi_month   <- incidence(linelist$date_onset, interval = "month")
## 248 missing observations were removed.
# Plot the incidence objects (+ titles for clarity)
############################
plot(epi_wk)+     labs(title = "Monday weeks")
plot(epi_Sun_wk)+ labs(title = "Sunday weeks")
plot(epi_3wk)+    labs(title = "Every 3 Monday weeks")
plot(epi_month)+  labs(title = "Months")

Modifications

Modifications

The incidence package enables modifications in the following ways:

  • Arguments of plot() (e.g. show_cases, col_pal, alpha…)
  • scale_x_incidence() and make_labels()
  • ggplot() additions via the + operator

Read details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.

plot() modifications

plot() modifications

A incidence plot can be modified in the following ways. Type ?plot.incidence in the R console for more details.

  • show_cases = If TRUE, each case is shows as a box. Best on smaller outbreaks.
  • color = Color of case bars/boxes
  • border = Color of line around boxes, if show_cases = TRUE
  • alpha = Transparency of case bars/boxes (1 is fully opaque, 0 is fully transparent)
  • xlab = Title of x-axis (axis labels can also be applied using labs() from ggplot)
  • ylab = Title of y-axis; defaults to user-defined incidence time interval
  • labels_week = Logical, indicate whether x-axis labels are in week or date format, absent other modifications
  • n_breaks = Number of x-axis label breaks, absent other modifications
  • first_date, last_date Dates used to trim the plot

See examples of these arguments in the subsequent tabs.

Filtered data

Filtered data

To plot the epicurve of a subset of data:

  1. Filter the linelist data
  2. Feed the subset to the incidence() command

The example below uses data filtered to show only cases at Central Hospital.

# filter the dataset
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using subset of data
central_outbreak <- incidence(central_data$date_onset, interval = "week")
## 15 missing observations were removed.
# plot
plot(central_outbreak) + labs(title = "Weekly case incidence at Central Hospital")

Date-axis labels/gridlines

Date-axis labels/gridlines

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars

Modify the bars
The aggregation of data into bars occurs when you set the interval = when creating the incidence object. The options for interval come from the package aweek and include options like “day”, “Monday week”, “Sunday week”, “month”, “2 weeks”, etc. See the incidence intro tab for more information.

Modify date-axis labels (frequency & format)

If working with the incidence package, you have several options to make these modifications. Some utilize the incidence package functions scale_x_date() and make_breaks(), others use the ggplot2 function scale_x_date(), and others use a combination.

DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

Option 1: scale_x_incidence() only
  1. Add scale_x_incidence() from the incidence package:
    • Why use this approach?
      • Advantages: Short code. Auto-adjusts weekly labels to interval of incidence object (Monday, Sunday weeks, etc.)
      • Disadvantages: Cannot make fine adjustments to label format or minor vertical gridlines between labels
    • Provide the name of the incidence object to ensure labels align with specified interval (e.g. Sundays or Mondays)
    • optional: n_breaks specify number of date labels, which start from the interval of the first case.
      • for breaks every nth week, use n_breaks = nrow(i)/n (“i” is the incidence object name and “n” is a number)
    • optional: labels_week labels formatted as either weeks (YYYY-Www) or dates (YYYY-MM-DD)
    • One vertical gridline will appear per date label

Other notes:

  • Type ?scale_x_incidence into the R console to see more information.
  • If incidence interval is “month”, n_breaks and labels_week will behave differently
  • Adding scale_x_date() to the plot will remove labels created by scale_x_incidence
  • Note in plot below that the first label is 27 April 2014, the Sunday before the first case (May 1), aligning with Sunday weeks of the incidence object.
# create weekly incidence object (Sunday weeks)
i <- incidence(central_data$date_onset, interval = "Sunday week")
## 15 missing observations were removed.
plot(i)+
  scale_x_incidence(i,                    # name of incidence object
                    labels_week = F,      # show dates instead of weeks
                    n_breaks = nrow(i)/8) # breaks every 8 weeks from week of first case
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Option 2: scale_x_date() and make_breaks()
  1. Add scale_x_date() from ggplot2, but also leverage make_breaks() from incidence:
    • Why use this approach?
      • Advantages: Best of both worlds: weekly labels auto-aligned to incidence interval, and you can make detailed adjustments to label format
      • Disadvantages: If minor gridlines between Sunday-week date labels are desired, they are not auto-aligned
    • After creating the incidence object, use make_breaks() to define date label breaks
      • make_breaks() is similar to scale_x_incidence() (described above). Provide the incidence object name and optionally n_breaks as described before.
    • Add scale_x_date() to the plot:
      • breaks = provide the breaks vector you created with make_breaks(), followed by $breaks (see example below)
      • date_labels = provide a format for the date labels (e.g. “%d %b”) (use “” for new line)
# Break modification using scale_x_date() and make_breaks()
###########################################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")
## 15 missing observations were removed.
# make breaks
i_labels <-  make_breaks(i, n_breaks = nrow(i)/6) # using interval from i, breaks every 6 weeks

# plot
plot(i)+
  scale_x_date(breaks      = i_labels$breaks, # call the breaks
               date_labels = "%d\n%b '%y",    # date format
               date_minor_breaks = "weeks")   # gridlines each week (aligns with Sundays only)  
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Option 3: Use scale_x_date() only
  1. Use scale_x_date() only
    • Advantages: Complete control over breaks, labels, gridlines, and plot width
    • Disadvantages: More code required, more opportunity to make mistakes.
    • If your incidence intervals are days or Monday weeks, (easy!):
      • Provide interval to date_breaks = (e.g. “day”, “week”, “2 weeks”, “month”, “year”)
      • Provide interval to date_minor_breaks = for vertical lines between date labels
    • If your incidence intervals are Sunday weeks, it is more complex - see the tab for a Sunday week example
      • Provide a sequence of Sunday dates to breaks = and to minor_breaks =
    • Use date_labels = for formatting (see Dates page for tips)
    • Add the argument expand = c(0,0) to start labels at the first incidence bar. Otherwise, first label will shift depending on your specified label interval.

*Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.

# Break modification using scale_x_date() only
##############################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")
## 15 missing observations were removed.
# plot
plot(i)+
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")  # date labels format 
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

A Sunday week example

If you want a plot of Sunday weeks and also finely-adjusted label formats, you might find a code example helpful.
Here is an example of producing a weekly epicurve using incidence for Sunday weeks, with finely-adjusted date labels through scale_x_date():

# load packages
pacman::p_load(tidyverse,  # for ggplot
               incidence,  # for epicurve
               lubridate)  # for floor_date() and ceiling_date()

# create incidence object (specifying SUNDAY weeks)
central_outbreak <- incidence(central_data$date_onset, interval = "Sunday week") # equivalent to "MMWRweek" (see US CDC)
## 15 missing observations were removed.
# plot() the incidence object
plot(central_outbreak)+                  
  
  ### ggplot() commands added to the plot
  # scale modifications 
  scale_x_date(
    expand = c(0,0),                 # remove excess x-axis space below and after case bars
    
    # sequence by 3 weeks, from Sunday before first case to Sunday after last case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "3 weeks"),
    
    # sequence by week, from Sunday before first case to Sunday after last case
    minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                            to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                            by   = "7 days"),
    # date labels
    date_labels = "%d\n%b'%y")+       # adjust how dates are displayed
  
  scale_y_continuous(
    expand = c(0,0),                  # remove excess space under x-axis
    breaks = seq(0, 30, 5))+          # adjust y-axis intervals
  
  # Aesthetic themes
  theme_minimal()+                    # simplify background
  theme(
    axis.title = element_text(size = 12, face = "bold"),       # axis titles formatting
    plot.caption = element_text(face = "italic", hjust = 0))+  # caption formatting, left-aligned
  
  # Plot labels
  labs(x = "Week of symptom onset (Sunday weeks)", 
       y = "Weekly case incidence", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Show individual cases

Show individual cases

To show boxes around each individual case, use the argument show_cases = TRUE in the plot() function.

Boxes around each case can be more reader-friendly, if the outbreak is of a small size. Boxes can be applied when the interval is days, weeks, or any other time period. The code below creates the weekly epicurve for a smaller outbreak (only cases from Central Hospital), with boxes around each case.

# create filtered dataset for Central Hospital
central_data  <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "Monday week")
## 15 missing observations were removed.
# plot outbreak
plot(central_outbreak,
     show_cases = T)                 # show boxes around individual cases

The same epicurve showing individual cases, but with other aesthetic modifications:

# add plot() arguments and ggplot() commands
plot(central_outbreak,
     show_cases = T,                 # show boxes around each individual case
     color = "lightblue",            # color inside boxes
     border = "darkblue",            # color of border around boxes
     alpha = 0.5)+                    # transparency
  
  ### ggplot() commands added to the plot
  # scale modifications
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "4 weeks",      # labels appear every 4 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b'%y")+   # date labels format 
  
  scale_y_continuous(
    expand = c(0,0),              # remove excess space under x-axis
    breaks = seq(0, 35, 5))+      # adjust y-axis intervals
  
  # aesthetic themes
  theme_minimal()+                                                 # simplify background
  
  theme(
    axis.title = element_text(size = 12, face = "bold"),       # axis title format
    plot.caption = element_text(face = "italic", hjust = 0))+  # caption format and left-align
  
  # plot labels
  labs(x = "Week of symptom onset (Monday weeks)", 
       y = "Weekly reported cases", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Color by value

Color by value

To color the cases by value, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their age category. Note the use of incidence() argument na_as_group =. If TRUE (by default) missing values (NA) will form their own group.

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset,            # date of onset for x-axis
                               interval = "week",         # weekly aggregation of cases
                               groups = linelist$age_cat, # color by age_cat value
                               na_as_group = TRUE)        # missing values assigned their own group
## 248 missing observations were removed.
# plot the epicurve
plot(age_outbreak) 

Adjusting order

To adjust the order of group appearance (on plot and in legend), the group column must be class Factor. Adjust the order by adjusting the order of the levels (including NA). Below is an example with gender groups using data from Central Hospital only.

  • First, the dataset is defined and gender is re-defined as a factor
  • The order of levels of gender are defined with NA first, so it appears on the top of the bars
  • More appropriate labels are defined for each factor level - these appear in the legend
  • The argument exclude = NULL in factor() is necessary to adjust the order of NA, which is excluded by default.
  • Title of legend adjusted using fill = in labs()

You can read more about factors in their page (LINK)

# Create incidence object, data grouped by gender
#################################################

# Classify "gender" column as factor
####################################
# with specific level order and labels, includin for missing values
central_data <- linelist %>% 
  filter(hospital == "Central Hospital") %>% 
  mutate(gender = factor(gender,
                         levels = c(NA, "f", "m"),
                         labels = c("Missing", "Female", "Male"),
                         exclude = NULL))

# Create incidence object, by gender
####################################
gender_outbreak_central <- incidence(central_data$date_onset, 
                                     interval = "week", 
                                     groups = central_data$gender,
                                     na_as_group = TRUE)   # Missing values assigned their own group
## 15 missing observations were removed.
# plot epicurve with modifications
##################################
plot(gender_outbreak_central,
     show_cases = TRUE)+                            # show box around each case
     
     ### ggplot commands added to plot
     # scale modifications
     scale_x_date(expand = c(0,0),
                  date_breaks = "6 weeks",
                  date_minor_breaks = "week",
                  date_labels = "%d %b\n%Y")+
  
     # aesthetic themes
     theme_minimal()+                               # simplify plot background
     theme(
       legend.title = element_text(size = 14, face = "bold"),
       axis.title = element_text(face = "bold"))+   # axis title bold
     
      # plot labels
      labs(fill = "Gender",                         # title of legend
           title = "Show case boxes, with modifications",
           y = "Weekly case incidence",
           x = "Week of symptom onset")      
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Change colors and legend

Change colors and legend

To change the legend
Use ggplot() commands such as:

  • theme(legend.position = "top") (or “bottom”, “left”, “right”)
  • theme(legend.direction = "horizontal")
  • theme(legend.title = element_blank()) to have no title

See the page of ggplot() tips for more details on legends.

To specify colors manually, provide the name of the color or a character vector of multiple colors to the argument color =. Note to function properly the number of colors listed must equal the number of groups (be aware of missing values as a group)

# weekly outbreak by hospital
hosp_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$hospital,
                               na_as_group = FALSE)   # Missing values not assigned their own group
## 248 missing observations were removed.
# default colors
plot(hosp_outbreak)

# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))

To change the color palette
Use the argument col_pal in plot() to change the color palette to one of the default base R palettes (do not put the name of the palette in quotes).

Other palettes include TO DO add page with palette names… To DO

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset,            # date of onset for x-axis
                               interval = "week",         # weekly aggregation of cases
                               groups = linelist$age_cat, # color by age_cat value
                               na_as_group = TRUE)        # missing values assigned their own group
## 248 missing observations were removed.
# plot the epicurve
plot(age_outbreak)

# plot with different color palette
plot(age_outbreak, col_pal = rainbow)

Facets/small multiples

Facets/small multiples

To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()

ggplot()

ggplot()

Below are tabs on using the ggplot2 package to produce epicurves from a linelist dataset.

Unlike using incidence package, you must manually control the aggregation of the data (into weeks, months, etc) and the labels on the date axis. If not carefully managed, this can lead to many headaches.

These tabs use a subset of the linelist dataset - only the cases from Central Hospital.

central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

Intro

Intro

To produce an epicurve with ggplot() there are three main elements:

  • A histogram, to aggregate the linelisted cases into “bins” and display bars of the counts per bin (potentially by grouped values)
  • Scales for the axes and their associated labels (see tab on modifications)
  • Aesthetic themes for the plot, including titles, labels, captions, etc.

Below is perhaps the most simple code to produce daily and weekly epicurves. Axis scales and labels use default options.

# daily 
ggplot(data = central_data, aes(x = date_onset)) +  # x column must be class Date
  geom_histogram(binwidth = 1)+                     # date values binned by 1 day 
  labs(title = "Daily")
## Warning: Removed 15 rows containing non-finite values (stat_bin).
# weekly
ggplot(data = central_data, aes(x = date_onset)) +  
  geom_histogram(binwidth = 7)+                     # date values binned each 7 days (arbitrary 7 days!) 
  labs(title = "Weekly")
## Warning: Removed 15 rows containing non-finite values (stat_bin).

CAUTION: Using binwidth = 7 starts the first bin at the first case, which could be any day of the week! To create specific Monday or Sunday weeks, see below .

To create weekly epicurves where the bins begin on a specific day of the week (e.g. Monday, Sunday), specify the histogram breaks = manually (not binwidth). This can be done by creating a sequence of dates using seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such weekly breaks is below:

seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
         to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
         by   = "7 days")

To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.

This code is applied to create the histogram breaks, and also the breaks for the date labels. Read more about the date labels in the Modifications tab. Defining your breaks like above will be necessary if your weekly bins are not by Monday weeks.

Below is detailed code to produce weekly epicurves for Monday and Sunday weeks. See the tab on Modifications (axes) to learn the nuances of date-axis label management.

Monday weeks

Of note:

  • The break points of the histogram bins are specified manually to begin the Monday (week_start = 1) before the earliest case and to end the Monday after the last case (see explanation above).
  • The breaks for date labels on x-axis - because the bins are Monday weeks this code uses date_breaks = within scale_x_date(), which also uses Monday weeks. Sunday weeks use a different method.
  • Minor vertical gridlines between date labels are made using date_minor_breaks = within scale_x_date(), again because this plot is for Monday weeks. Sunday weeks use a different method.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.
  • Color and fill are defined in geom_histogram()
# TOTAL MONDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days"), # bins are 7-days
    color = "darkblue",   # color of lines around bars
    fill = "lightblue") + # color of fill within bars
  
  # x-axis labels
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+             # remove excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases (Monday weeks)",
       subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Mondays",
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Warning: Removed 15 rows containing non-finite values (stat_bin).

Sunday weeks

The below code creates a histogram of the rows, using a date column as the x-axis. Of note:

  • The break points of the histogram bins are specified manually to begin the Sunday (week_start = 7) before the earliest case and to end the Monday after the last case (see explanation above).
  • The breaks for date labels on the x-axis and vertical gridlines - because the bins are not Monday weeks, manually specified vectors of dates are given to breaks = and minor_breaks = within scale_x_date(). You cannot use the scale_x_date() arguments of date_breaks and date_minor_breaks as these align with Monday weeks.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.
  • Color and fill are defined in geom_histogram()
# TOTAL SUNDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  
  # For histogram, manually specify bin break points: starts the Sunday before first case, end Sunday after last case
  geom_histogram(                    
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "7 days"), # bins are 7-days
    color = "darkblue",   # color of lines around bars
    fill = "lightblue") + # color of fill within bars
  
  # The labels on the x-axis
  scale_x_date(expand = c(0,0),
               breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                 to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                 by   = "3 weeks"),
               minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                       to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                       by   = "7 days"),
               date_labels = "%d\n%b\n'%y")+             # day, above month abbrev., above 2-digit year
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases (Sunday weeks)",
       subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Sundays",
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Warning: Removed 15 rows containing non-finite values (stat_bin).

Modifications

Modifications

Modify axes

Modify axes

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars

To modify the aggregation of data into bins/bars, do one of the following:

  • Specify a binwidth = within geom_histogram() - for a column of class Date, the given number is interpreted in days
  • Specify breaks = as a sequence of bin break-point dates
  • Group the rows into aggregated counts (by week, month, etc.) and feed the aggregated counts to ggplot(). See the tab on aggregated counts for more information.

To modify the date labels, use scale_x_date() in one of these ways:

  • If your histogram bins are days, Monday weeks, months, or years:
    • Use date_breaks = to specify label frequency (e.g. “day”, “week”, “3 weeks”, “month”, or “year”)
    • Use date_minor_breaks = to specify frequency of minor vertical gridlines between date labels
    • Add expand = c(0,0) to begin the labels at the first bar (otherwise, first label will shift forward depending on specified frequency)
    • Use date_labels = to specify format of date labels - see the Dates page for tips (use \n for a new line)
  • If your histogram bins are Sunday weeks:
    • Use breaks = and minor_breaks = by providing a sequence of dates for breaks
    • You can still use date_labels = for formatting as described above

To create a sequence of dates
You can use seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such flexible breaks is below:

seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
         to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
         by   = "7 days")

To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.

If using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.

Set maximum and minimum date values using limits = c() within scale_x_date(). E.g. scale_x_date(limits = c(as.Date("2014-04-01), NA)) sets a minimum but leaves the maximum open.

CAUTION: Caution using limits! They remove all data outside the limits, which can impact y-axis max/min, modeling, and other statistics. Strongly consider instead using limits by adding coord_cartesian() to your plot, which acts as a “zoom” without removing data.

DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

https://rdrr.io/r/base/strptime.html —– see all % shortcuts

Below is a demonstration of some plots where the bins and the plot labels/gridlines are aligned and not aligned:
Click “Code” to see the code

# 7-day binwidth defaults
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
  geom_histogram(
    binwidth = 7,                       # 7 days per bin (! starts at first case!)
    color = "darkblue",                 # color of lines around bars
    fill = "lightblue") +               # color of bar fill
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\ndefault axis labels/ticks not aligned")
## Warning: Removed 15 rows containing non-finite values (stat_bin).
# 7-day bins + Monday labels
#############################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(
    binwidth = 7,                 # 7-day bins with start at first case
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),               # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",       # Monday every 3 weeks
    date_minor_breaks = "week",    # Monday weeks
    date_labels = "%d\n%b\n'%y")+  # label format
  
  scale_y_continuous(
    expand = c(0,0))+              # remove excess space under x-axis, make flush with labels
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nDate labels and gridlines on Mondays")
## Warning: Removed 15 rows containing non-finite values (stat_bin).
# 7-day bins + Months
#####################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(
    binwidth = 7,
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),                 # remove excess x-axis space below and after case bars
    date_breaks = "months",          # 1st of month
    date_minor_breaks = "week",      # Monday weeks
    date_labels = "%d\n%b\n'%y")+    # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush with labels
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nGridlines at 1st of each month (with labels) and weekly on Mondays\nLabels on 1st of each month")
## Warning: Removed 15 rows containing non-finite values (stat_bin).
# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    # histogram breaks set to 7 days beginning Monday before first case
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  labs(
    title = "ALIGNED Mondays",
    subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels and gridlines on Mondays as well")
## Warning: Removed 15 rows containing non-finite values (stat_bin).
# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    # histogram breaks set to 7 days beginning Sunday before first case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),
    # date label breaks set to every 3 weeks beginning Sunday before first case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "3 weeks"),
    # gridlines set to weekly beginning Sunday before first case
    minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                            to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                            by   = "7 days"),
    date_labels = "%d\n%b\n'%y")+  # label format
  
  labs(title = "ALIGNED Sundays",
       subtitle = "7-day bins manually set to begin Sunday before first case (27 Apr)\nDate labels and gridlines manually set to Sundays as well")
## Warning: Removed 15 rows containing non-finite values (stat_bin).
# Check values of bars by creating dataframe of grouped values
# central_tab <- central_data %>% 
#   mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>% 
#   group_by(week, .drop=F) %>%
#   summarize(n = n()) %>% 
#   mutate(groups_3wk = 1:(nrow(central_tab)+1) %/% 3) %>% 
#   group_by(groups_3wk) %>% 
#   summarize(n = n())

Color by groups

Color by groups

Designate a column containing groups

In any of the code template (Sunday weeks, Monday weeks), make the following changes:

  • Add the aesthetics argument aes() within the geom_histogram() (don’t forget comma afterward)
  • Within aes(), provide the grouping column name to group = and fill = (no quotes needed). group is necessary, while fill changes the color of the bar.
  • Remove any fill = argument outside of the aes(), as it will override the one inside
  • Arguments inside aes() will apply by group, whereas any outside will apply to all bars (e.g. you may want color = outside, so each bar has the same color perimeter/border)
geom_histogram(
    aes(group = gender, fill = gender))

Adjust colors:

  • To manually adjust the bar fill color of each group, use scale_fill_manual() (note scale_color_manual() is different!).
    • Use the values = argument to apply a vector of colors.
    • Use na.value = to specify a color for missing values.
    • ! While you can use the labels = argument in scale_fill_manual() change the legend text labels - it is easy to accidentally give labels in the incorrect order and have an incorrect legend! It is recommended to instead convert the group column to class Factor and designate factor labels and order, as explained below.
  • To adjust the colors via a color scale, see the page on ggplot tips

Adjust the stacking order and Legend

Stacking order, and the labels for each group in the legend, is best adjusted by classifying the group column as class Factor. You can then designate the levels and their labels, and the order (which is reflected in stack order).

Step 1: Before making the ggplot, convert the group column to class Factor using factor() from base R.
For example, with a column “gender” with values “m” and “f” and NA, this can be put in a mutate() command as:

dataset <- dataset %>% 
  mutate(gender = factor(gender,
                    levels = c(NA, "f", "m"),
                    labels = c("Missing", "Female", "Male"),
                    exclude = NULL))

The above code establishes the levels, in the ordering that missing values are “first” (and will appear on top). Then the labels that will show are given in the same order. Lastly, the exclude statement ensures that NA is included in the ordering (by default factor() ignores NA).

Read more about factors in their dedicated handbook page (LINK).

Adjusting the legend

Read more about legends in the ggplot tips page. Here are a few highlights:

  • theme(legend.position = "top") (or “bottom”, “left”, “right”)
  • theme(legend.direction = "horizontal")
  • theme(legend.title = element_blank()) to have no title

See the page of ggplot() tips for more details on legends.

These steps are shown in the example below:

Click “Code” to see the code

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(
  from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
  to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
  by   = "7 days") # bins are 7-days

# Set gender as factor and missing values as first level (to show on top)
central_data <- linelist %>%
  filter(hospital == "Central Hospital") %>% 
  mutate(gender = factor(
    gender,
    levels = c(NA, "f", "m"),
    labels = c("Missing", "Female", "Male"),
    exclude = NULL))  

# make plot
###########
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    aes(group = gender, fill = gender),    # arguments inside aes() apply by group
    color = "black",                       # arguments outside aes() apply to all data
    breaks = bin_breaks)+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(
    values = c("grey", "orange", "purple"))+ # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(
    plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
    axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(
    title    = "Weekly incidence of cases, by gender",
    subtitle = "Subtitle",
    fill     = "Gender",                                      # provide new title for legend
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Warning: Removed 15 rows containing non-finite values (stat_bin).

Display bars side-by-side

Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge".
If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples) (see tab). To improve readability in this example, missing gender values are removed.

Click “Code” to see the code

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days") # bins are 7-days

# New dataset without rows missing gender
central_data_dodge <- linelist %>% 
  filter(hospital == "Central Hospital") %>% 
  filter(!is.na(gender)) %>%                            # remove rows missing gender
  mutate(gender = factor(gender,                        # factor now has only two levels (missing not included)
                         levels = c("f", "m"),
                         labels = c("Female", "Male")))  

# make plot
###########
ggplot(central_data_dodge, aes(x = date_onset)) + 
    geom_histogram(
        aes(group = gender, fill = gender),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks,
        position = "dodge")+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("pink", "lightblue"))+     # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Warning: Removed 14 rows containing non-finite values (stat_bin).

Faceting/small-multiples

Faceting/small-multiples

As with other ggplots, you can create facetted plots (“small multiples”) off values in a column. As explained in the ggplot tips page of this handbook, you can use either:

  • facet_wrap()
  • facet_grid()

For epicurves, facet_wrap() is typically easiest as it is likely that you only need to facet on one column. The general syntax is facet_wrap(rows ~ cols), where to the left of the tilde (~) is the name of a column to be spread across the “rows” of the new plot, and to the right of the tilde is the name of a column to be spread across the “columns” of the new plot.

Most simply, just use one column name, to the right of the tilde: facet_wrap(~age_cat).

Free axes
You will need to decide whether the scales (scales =) of the axes for each facet are “fixed” to the same dimensions (default), or “free” (meaning they will change based on the data within the facet). You can also specify “free_x” or “free_y” to release in only one dimension.

Number of cols and rows
This can be specified with ncol = and nrow = within facet_wrap().

Order of panels
To change the order of appearance, change the underlying order of the levels of the factor column used to create the facets.

Aesthetics
Font size and face, strip color, etc. can be modified through theme() with arguments like:

  • strip.text = element_text() (size, colour, face, angle…)
  • strip.background = element_rect() (e.g. element_rect(fill=“red”))

The position of the strip can be modified as the strip.position = argument within facet_wrap() (e.g. “bottom”, “top”, “left”, “right”)

Strip labels
Labels of the facet plots can be modified through the “labels” of the column as a factor, or by the use of a “labeller”.

Make a labeller like this, using the function as_labeller() from ggplot2:

my_labels <- as_labeller(c(
     "0-4"   = "Ages 0-4",
     "5-9"   = "Ages 5-9",
     "10-14" = "Ages 10-14",
     "15-19" = "Ages 15-19",
     "20-29" = "Ages 20-29",
     "30-49" = "Ages 30-49",
     "50-69" = "Ages 50-69",
     "70+"   = "Over age 70"))

An example plot
Faceted by column age_cat. Click “Code” to see the code.

# make plot
###########
ggplot(central_data, aes(x = date_onset)) + 
  
  geom_histogram(
        aes(group = age_cat, fill = age_cat),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
    
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(face = "bold", size = 10),
        strip.background = element_rect(fill = "grey"))+               # axis titles in bold
  
  # create facets
  facet_wrap(~age_cat,
             ncol = 4,
             strip.position = "top",
             labeller = my_labels)+             
  
  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Warning: Removed 15 rows containing non-finite values (stat_bin).

See this link for more information on labellers.

Add total epidemic to background
Add a separate geom_histogram() command before the current one. Specify that the data used is the data without the column used for faceting (see select()). Then, specify a color like “grey” and a degree of transparency to make it appear in the background.

geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+

Note that the y-axis maximum is now based on the height of the entire epidemic. Click “Code” to see the code.

ggplot(central_data, aes(x = date_onset)) + 
  
  # for background shadow of whole outbreak
  geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+

  # actual epicurves by group
  geom_histogram(
        aes(group = age_cat, fill = age_cat),  # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
  # Labels on x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(face = "bold", size = 10),
        strip.background = element_rect(fill = "white"))+               # axis titles in bold
  
  # create facets
  facet_wrap(~age_cat,                          # each plot is one value of age_cat
             ncol = 4,                          # number of columns
             strip.position = "top",            # position of the facet title/strip
             labeller = my_labels)+             # labeller defines above
  
  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 105 rows containing non-finite values (stat_bin).
## Warning: Removed 15 rows containing non-finite values (stat_bin).

Create one facet with ALL data
To do this, you duplicate all the data (double the number of rows in the dataset) and in the faceted column have a new value (e.g. “all”) which indicates all the duplicated rows. A helped function is below that enables this:

# Define helper function
CreateAllFacet <- function(df, col){
     df$facet <- df[[col]]
     temp <- df
     temp$facet <- "all"
     merged <-rbind(temp, df)
     
     # ensure the facet value is a factor
     merged[[col]] <- as.factor(merged[[col]])
     
     return(merged)
}

# Create dataset that is duplicated, to show "all zones" as another facet level
central_data2 <- CreateAllFacet(central_data, col = "age_cat") %>%
  mutate(facet = factor(facet,
                        levels = c("all", "0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70+")))

# check
table(central_data2$facet, useNA = "always")
## 
##   all   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##   454    84    87    77    73    84    37     0     0    12

Notable changes to the ggplot command are:

  • The data used is now central_data2 (double the rows, with new column “facet”)
  • Labeller will need to be updated, if used
  • To achieve long/thin plot, facet variable moved to rows side of equation, replaced by “.” facet_wrap(facet~.), and ncol = 1

You may also need to adjust the width and height of the save plot image (see ggsave()).

ggplot(central_data2, aes(x = date_onset)) + 
  
  # actual epicurves by group
  geom_histogram(
        aes(group = age_cat, fill = age_cat),  # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
  # Labels on x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom")+               
  
  # create facets
  facet_wrap(facet~. ,                            # each plot is one value of facet
             ncol = 1)+            

  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## Warning: Removed 30 rows containing non-finite values (stat_bin).

Moving averages

Moving averages

Add a moving averages to a ggplot() epicurve in one of two ways:

  1. Plot the pre-calculated moving average:
    • Aggregate the data as necessary (daily, weekly, etc.)
    • Calculate the moving average
    • Add the moving average to the ggplot (e.g. with geom_line())
  2. Calculate on-the-fly within the ggplot() command
Using slider

In this approach, the moving average is calculated in the dataset prior to plotting:

  • Within mutate(), a new column is created to hold the average. slide_index() from slider package is used as shown below.
  • In the ggplot(), a geom_line() is added after the histogram, reflecting the moving average.

See the helpful online vignette for the slider package

pacman::p_load(slider)  # slider used to calculate rolling averages

# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>% 
  ## count cases by date
  count(date_onset,
        name = "new_cases") %>%   # name of new column
  filter(!is.na(date_onset)) %>%  # remove cases with missing date_onset
  
  ## calculate the average number of cases in the preceding 7 days
  mutate(
    avg_7day = slider::slide_index(    # create new column
      new_cases,                       # calculate based on value in new_cases column
      .i = date_onset,                 # index is date_onset col, so non-present dates are included in window 
      .f = ~mean(.x, na.rm = TRUE),    # function is mean() with missing values removed
      .before = 6,                     # window is the day and 6-days before
      .complete = FALSE),              # must be FALSE for unlist() to work in next step
    avg_7day = unlist(avg_7day))


# plot
######
ggplot(data = ll_counts_7day, aes(x = date_onset)) +
    geom_histogram(aes(y = new_cases),
                   fill="#92a8d1",
                   stat = "identity",
                   position = "stack",
                   colour = "#92a8d1")+ 
    geom_line(aes(y = avg_7day, lty = "7-day \nrolling avg"),
              color="red",
              size = 1) + 
    scale_x_date(date_breaks = "1 month",
                 date_labels = '%d/%m',
                 expand = c(0,0)) +
    scale_y_continuous(expand = c(0,0),
                       limits = c(0, NA)) + 
    labs(x="",
         y ="Number of confirmed cases",
         fill = "Legend")+ 
    theme_minimal()+
    theme(legend.title = element_blank())  # removes title of legend
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Using tidyquant

Using the tidyquant package to calculate the moving average on-the-fly (within ggplot()).

This option is more difficult to modify than pre-calculating the moving average. By default,geom_ma() uses the Simple Moving Average (SMA) (TRR::SMA()). See documentation by entering ?SMA in your R console. Calculates the arithmatic mean over the past n observations. Also note how the moving average does not begin as early as the previous example.

library(tidyquant)

# make daily count data
#######################
ll_counts_7day <- linelist %>% 
  count(date_onset, name = "daily_cases")


# plot
######
ggplot(data = ll_counts_7day,   # use daily count data
       aes(x = date_onset,      # date x-axis
           y = daily_cases))+   # counts
  
  # histogram in the background
  geom_histogram(stat = "identity",    # height = value in the cell, not number of rows
                 color = "#92a8d1",    # color of lines within histogram
                 fill = "#92a8d1")+    # color of histogram
  
  # moving average line
  tidyquant::geom_ma(n = 7,            # window width
                     size = 2,         # line size
                     color = "black",  # line color
                     lty = "solid"     # line type ()
                     )+
     
  # labels for x-axis
  scale_x_date(date_breaks = "2 months",      # labels every 2 months 
               date_minor_breaks = "1 month", # gridlines every month
               date_labels = '%b\n%Y')+       #labeled by month with year below
     
  # Choose color palette (uses RColorBrewer package)
  scale_fill_brewer(palette = "Pastel2")+ 
  
  theme_minimal()+
  
  labs(x = "Date of onset", 
       y = "Daily case incidence",
       title = "Daily case incidence, with 7-day moving average")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 1 rows containing missing values (position_stack).

Tentative data

Tentative data

The most recent data shown in epicurves should often be marked as tentative, or subject to reporting delays. This can be done in by adding a vertical line and/or rectangle over a specified number of days. Here are two options:

  1. Use annotate():
  • Pros: Transparency of rectangle is easy. Cons: Items will not appear in legend.
  • For a line use annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color.
  • For a rectangle use annotate(geom = "rect"). Provide xmin/xmax/ymin/ymax. Adjust color and alpha.
  1. Use geom_segment() and geom_rect():
  • Pros: Items can easily appear in legend. Cons: Difficult to achieve semi-transparency of rectangle.
  • Provide the same x/y arguments as noted above for annotate()

CAUTION: While you can use geom_rect() to draw a rectangle, adjusting the transparency (alpha) does not work in a linelist context. This function overlays a rectangle for each observation/row!. Try a very low alpha (e.g. 0.01), or use annotate(geom = "rect") as shown.

Using annotate()
  • Within annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date.
  • Note that because these data are aggregated into weekly bars, and the last bar extends to the Monday after the last data point, the shaded region may appear to cover 4 weeks
  • annotate() online example
ggplot(central_data, aes(x = date_onset)) + 
  
  # histogram
  geom_histogram(
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "1 month",           # 1st of month
    date_minor_breaks = "1 month",     # 1st of month
    date_labels = "%b\n'%y")+          # label format
  
  # labels and theme
  labs(title = "Using annotate()\nRectangle and line showing that data from last 21-days are tentative",
    x = "Week of symptom onset",
    y = "Weekly case indicence")+ 
  theme_minimal()+
  
  # add semi-transparent red rectangle to tentative data
  annotate("rect",
           xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
           xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
           ymin  = 0,
           ymax  = Inf,
           alpha = 0.2,          # alpha easy and intuitive to adjust using annotate()
           fill  = "red")+
  
  # add black vertical line on top of other layers
  annotate("segment",
           x     = max(central_data$date_onset, na.rm = T) - 21, # 21 days before last data
           xend  = max(central_data$date_onset, na.rm = T) - 21, 
           y     = 0,         # line begins at y = 0
           yend  = Inf,       # line to top of plot
           size  = 2,         # line size
           color = "black",
           lty   = "solid")+   # linetype e.g. "solid", "dashed"

  # add text in rectangle
  annotate("text",
           x = max(central_data$date_onset, na.rm = T) - 15,
           y = 20,
           label = "Subject to reporting delays",
           angle = 90)
## Warning: Removed 15 rows containing non-finite values (stat_bin).

The same black vertical line can be achieved with the code below, but using geom_vline() you lose the ability to control the height:

geom_vline(xintercept = max(central_data$date_onset, na.rm = T) - 21,
           size = 2,
           color = "black")
Using geom_segment() and geom_rect()
ggplot(central_data, aes(x = date_onset)) + 
  
  # histogram
  geom_histogram(
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  # labels and theme
  labs(title = "Using geom_segment() and geom_rect()\nRectangle and line showing that data from last 21-days are tentative",
    subtitle = "")+ 
  theme_minimal()+
  
  # make rectangle covering last 21 days
  geom_rect(aes(
              xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
              xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
              ymin  = 0,
              ymax  = Inf,
              color = "Reporting delays\npossible"),    # sets label for legend (note: is within aes())
              alpha = .002,                             # !!! Difficult to adjust transparency with this option
           fill  = "red")+
  
  # make vertical line
  geom_segment(aes(x = max(central_data$date_onset, na.rm = T) - 21,
                   xend = max(central_data$date_onset, na.rm = T) - 21,
                   y = 0,
                   yend = Inf),
               color = "black",
               lty = "solid",
               size = 2)+
  theme(legend.title = element_blank())                 # remove title of legend
## Warning: Use of `central_data$date_onset` is discouraged. Use `date_onset` instead.

## Warning: Use of `central_data$date_onset` is discouraged. Use `date_onset` instead.

## Warning: Use of `central_data$date_onset` is discouraged. Use `date_onset` instead.
## Warning: Removed 15 rows containing non-finite values (stat_bin).

Dual axis

Dual axis

Two axes

Multi-level date labels

Multi-level date labels

Here is an option if you want multi-level date labels, without duplicating the lower label levels (e.g. for year or month).

Remember, you can can use tools like \n within the date_labels or labels arguments to put parts of each label on a new line below. However, the code below helps you take years or months (for example) on a lower line and only once.

A few notes on the code below:

  • Case counts are aggregated into weeks for aesthetic reasons. See Epicurves page (aggregated data tab) for details.
  • A line is used instead of a histogram, as the faceting approach below does not work well with histograms.

Aggreagate the weekly counts

# Create dataset of case counts by week
#######################################
central_weekly <- linelist %>%
  filter(hospital == "Central Hospital") %>%           # filter linelist
  mutate(week = lubridate::floor_date(date_onset, unit = "weeks")) %>%  
  count(week, .drop=F) %>%                             # summarize weekly case counts
  filter(!is.na(week)) %>%                             # remove cases with missing onset_date
  complete(week = seq.Date(from = min(week),           # fill-in all weeks with no cases reported
                           to   = max(week),
                           by   = "week"))

Make plots

# plot
######
ggplot(central_weekly) +
  geom_line(aes(x = week, y = n),    # make line, specify x and y
            stat = "identity") +             # because line height is count number
  scale_x_date(date_labels="%b",             # date label format show month 
               date_breaks="month",          # date labels on 1st of each month
               expand=c(0,0)) +              # remove excess space
  facet_grid(~lubridate::year(week), # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",                # x-axes adapt to data range (not "fixed")
             switch="x") +                   # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",         # facet labels placement
        strip.background = element_rect(fill = NA, # facet labels no fill grey border
                                        colour = "grey50"),
        panel.spacing = unit(0, "cm"))+      # no space between facet panels
  labs(title = "Nested year labels, grey label border")

# plot no border
################
ggplot(central_weekly,
       aes(x = week, y = n)) +              # establish x and y for entire plot
  geom_line(stat = "identity",              # make line, line height is count number
            color = "#69b3a2") +            # line color
  geom_point(size=1, color="#69b3a2") +     # make points at the weekly data points
  geom_area(fill = "#69b3a2",               # fill area below line
            alpha = 0.4)+                   # fill transparency
  scale_x_date(date_labels="%b",            # date label format show month 
               date_breaks="month",         # date labels on 1st of each month
               expand=c(0,0)) +             # remove excess space
  facet_grid(~lubridate::year(week),   # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",               # x-axes adapt to data range (not "fixed")
             switch="x") +                  # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",                     # facet label placement
          strip.background = element_blank(),            # no facet lable background
          panel.grid.minor.x = element_blank(),          
          panel.border = element_rect(colour="grey40"),  # grey border to facet PANEL
          panel.spacing=unit(0,"cm"))+                   # No space between facet panels
  labs(title = "Nested year labels - points, shaded, no label border")
## Warning: Removed 5 rows containing missing values (position_stack).
## Warning: Removed 5 rows containing missing values (geom_point).

The above techniques were adapted from this and this post on stackoverflow.com.

Aggregating linelist data

Aggregating linelist data

To learn generally how to group and aggregate data, see the handbook page on Grouping/Aggregating.

In this circumstance, we demonstrate aggregating into weeks, months, and days.

Weeks

Create a new column that is weeks, then use group_by() with summarize() to get weekly case counts.

To aggregate into weeks and show ALL weeks (even ones with no cases), do this:

  1. Create a new ‘week’ column within mutate(), using floor_date() from the lubridate package:
    • use unit = to set the desired time unit, e.g. "week`
    • use week_start = to set the weekday start of the week (7 = Sunday, 1 = Monday)
  2. Follow with complete() to ensure that all weeks appear - even those with no cases.

For example:

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(
    week = lubridate::floor_date(date_onset,
                                 unit = "week")) %>%  # new column of week of onset
  count(week) %>%                                     # group data by week and count rows per group
  filter(!is.na(week)) %>%                            # remove entries for cases missing date_onset
  complete(week = seq.Date(from = min(week),          # fill-in all weeks with no cases reported
                           to = max(week),
                           by="week")) %>% 
  ungroup()                                           # deactivate grouping

Here are the first 50 rows of the resulting dataframe:

Alternatively, you can use the aweek package’s date2week() function. As shown below, set week_start = to “Sunday”, or “Monday”, etc. Set floor_date = TRUE so the output is YYYY-Www. Set factor = TRUE so that all possible weeks are included, even if there are no cases (this replaces the complete() step in the lubridate approach above). You can also use numeric = TRUE if you want only the week number (note this will not distinguish between years).

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(week = aweek::date2week(date_onset,          # new column of week of onset
                                 floor_day = T,       # show as weeks without weekday
                                 factor = TRUE)) %>%  # include all possible weeks
  count(week) %>% 
  ungroup()                                           # deactivate grouping

# Optional: add column of start DATE for each week - e.g. for ggplot() when date x-axis is expected
# note: add this step AFTER the above code, to ensure all weeks are present
weekly_counts <- weekly_counts %>% 
  mutate(week_as_date = aweek::week2date(week, week_start = "Monday")) # output is Monday date of each week
Months

To aggregate cases into months, again use floor_date() from the lubridate package, but with the argument unit = "months". This rounds each date down to the 1st of its month. The output will be class Date.

Note that in the complete() step we also use “months”

# Make dataset of weekly case counts
monthly_counts <- linelist %>% 
  mutate(month = lubridate::floor_date(date_onset, unit = "months")) %>%   # new column, 1st of month of onset
  count(month) %>% 
  filter(!is.na(month)) %>% 
  complete(month = seq.Date(min(month),     # fill-in all months with no cases reported
                            max(month),
                            by="month"))    
Days

To aggregate a linelist into days, use the same approach but there is no need to create a new column. Use group_by() on the date column (e.g. date_onset).

If plotting a histogram, missing days in the data are not a problem as long as the column is class Date. However, it may be important for other types of plots or tables to have all possible days apear in the data. This is done with: tidyr::complete()

# Make dataset of weekly case counts
daily_counts <- linelist %>% 
  count(date_onset) %>%                           # count number of rows per unique date
  filter(!is.na(date_onset)) %>%                  # remove aggregation of rows that were missing date_onset
  complete(date_onset = seq.Date(min(date_onset), # ensure all days appear
                                 max(date_onset),
                                 by="day"))  

Aggregated data

Aggregated data

Often instead of a linelist, you begin with aggregated counts from facilities, districts, etc. You can make an epicurve from ggplot() but the code will be slightly different. The incidence package does not support aggregate data.

This section will utilize the count_data dataset that was imported earlier, in the data preparation section. It is the linelist aggregated to day-hospital counts. The first 50 rows are displayed below.

As before, we must ensure date variables are correctly classified.

# Convert Date variable to Date class
class(count_data$date_hospitalisation)
## [1] "Date"

We can plot a daily epicurve from these data. Here are the differences:

  • Specify y = to the counts column within the primary aesthetics aes()
  • Use of stat = "identity" within geom_histogram() indicates that the y-values could be counts from the y = column in aes()
ggplot(data = count_data, aes(x = as.Date(date_hospitalisation), y = n_cases))+
     geom_histogram(stat = "identity")+
     labs(x = "Week of report", 
          y = "Number of cases",
          Title = "Daily case incidence, from daily count data")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

aggregate further

To aggregated further, into weeks, we use the package lubridate and function floor_date(), as described above. Note that we use group_by() and summarize() in place of count() becase we need to sum() case counts instead of just counting the number of rows per group.

# Create weekly dataset with epiweek column
count_data_weekly <- count_data %>%
  mutate(epiweek = lubridate::floor_date(date_hospitalisation, "week")) %>% 
  group_by(hospital, epiweek, .drop=F) %>% 
  summarize(n_cases_weekly = sum(n_cases, na.rm=T))   
## `summarise()` has grouped output by 'hospital'. You can override using the `.groups` argument.

The first 50 rows of count_data are displayed below.

For the plotting we also specify the factor level order of hospital.

count_data_weekly <- count_data_weekly %>% 
  mutate(hospital = factor(hospital,
                           levels = c("Missing", "Port Hospital", "Military Hospital", "Central Hospital", "St. Mark's Maternity Hospital (SMMH)", "Other")))

Now plot by epiweek.

ggplot(data = count_data_weekly,
       aes(x = epiweek,
           y = n_cases_weekly,
           group = hospital,
           fill = hospital))+
  
  geom_histogram(stat = "identity")+
     
  # labels for x-axis
  scale_x_date(date_breaks = "2 months",      # labels every 2 months 
               date_minor_breaks = "1 month", # gridlines every month
               date_labels = '%b\n%Y')+       #labeled by month with year below
     
  # Choose color palette (uses RColorBrewer package)
  scale_fill_brewer(palette = "Pastel2")+ 
  
  theme_minimal()+
  
  labs(x = "Week of onset", 
       y = "Weekly case incidence",
       fill = "Hospital",
       title = "Weekly case incidence, from aggregated count data by hospital")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Dual-axis

Dual-axis

Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.

In ggplot it is difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.

See the handbook page on ggplot tips for details on how to make a second axis.

Resources

Resources

Links to other online tutorials or resources.

Plotting continuous data

For appropriate plotting of continuous data, e.g. age, clinical measurements, distance, etc.

Overview

Overview

As usual, R has built-in functions for quick visualisations. You can opt to install additional packages with more functionality - this is often recommended for presentation-ready visualisations. Specifically, you can use:

  • the boxplot() function from the graphics package (installed automatically with base R)
  • the ggplot() function from the ggplot2 package, or

Visualisations covered here include:

  • Plots for one continuous variable:

    • Box plots (also called box and whisker), in which the box represents the 25th, 50th, and 75th percentile of a continuous variable, and the line outside of this represent tail ends of distribution of the the continuous variable, and dots represent outliers.
    • Violin plots, which are similar to histograms in that they show the distribution of a continuous variable based on the symettrical width of the ‘violin’.
    • Jitter plots, which visualise the distribution of a continuous variable by showing all values as dots, rather than collectively as one larger shape. Each dot is ‘jittered’ so that they can all (mostly) be seen, even where two have the same value.
  • Scatter plots for two continuous variables.

## Warning: Removed 88 rows containing non-finite values (stat_boxplot).
## Warning: Removed 88 rows containing non-finite values (stat_ydensity).
## Warning: Removed 88 rows containing missing values (geom_point).

Preparation

Preparation

Preparation includes ensuring you have the correct packages, (install.packages("ggplot2") if needed), and ensuring your data is the correct class and format.

Convert character outcomes to numeric as needed:

linelist <- linelist %>% 
  mutate(age = as.numeric(age))

Plotting with base graphics

In-built graphics package

Plotting one continuous variable

The in-built graphics package comes with the boxplot() function, allowing straight-forward visualisation of a continuous variable for the whole dataset (A below) or within different groups (B and C below). Note how with C, outcome and gender are written as outcome*gender such that the boxplots are for the four combinations of the two columns.

# For total population
graphics::boxplot(linelist$age,
                  main = "A) One boxplot() for total dataset") # Plot title


# By subgroup
graphics::boxplot(age ~ outcome*gender,
                  data = linelist, # Here 'data' is specified so no need to write 'linelist$age' in line above.
                  main = "B) boxplot() by subgroup")

# By crossed subgroups
graphics::boxplot(age ~ outcome*gender,
                  data = linelist, # Here 'data' is specified so no need to write 'linelist$age' in line above.
                  main = "C) boxplot() by crossed groups")

Some further options with boxplot() shown below are:

  • Boxplot width proportional to sample size (A)
  • violin plots, with notched representing the median and x around it (B; TO DO)
  • Horizontal (C)
# Varying width by sample size 
graphics::boxplot(linelist$age ~ linelist$outcome,
                  varwidth = TRUE, # width varying by sample size
                  main="A) Proportional boxplot() widths")

                  
# Notched (violin plot), and varying width
boxplot(age ~ outcome,
        data=linelist,
        notch=TRUE,      # notch at median
        main="B) Notched boxplot()",
        col=(c("gold","darkgreen")),
        xlab="Suppliment and Dose")

# Horizontal
boxplot(age ~ outcome,
        data=linelist,
        horizontal=TRUE,  # flip to horizontal
        col=(c("gold","darkgreen")),
        main="C) Horizontal boxplot()",
        xlab="Suppliment and Dose")

Plotting two continuous variables

Scatter plots are helpful for visualising the correlation between two continuous variables.

Using base R, they can simple be visualisation with the plot function.

plot(linelist$age)

Plotting with ggplot

Plotting with ggplot()

Code syntax

Ggplot has extensive functionality, and the same code syntax can be used for many different plot types.

A basic breakdown of the ggplot code is as follows:

ggplot(data = linelist,
       aes(x = col1, y = col2),
       fill = "color")+  
  geom_boxplot() 
  • ggplot() starts off the function. You can specify the data and aesthetics (see next point) within the ggplot bracket, unless you are combining different data sources or plot types into one
  • aes() stands for ‘aesthetics’, and is where the columns used for the visualisation are specified. For instance aes(x = col1, y = col2) to specify the data used for the x and y values (where y is the continuous variable in these examples).
  • fill specifies the colour of the boxplot areas. One could also write color to specify outline or point colour.
  • geom_XXX specifies what type of plot. Options include:
    • geom_boxplot() for a boxplot
    • geom_violin() for a violin plot
    • geom_jitter() for a jitter plot
    • geom_point() for a scatter plot

For more see section on ggplot tips).

Plotting one continuous variable

Below is code for creating box plots, for an entire dataset and by sub group. Note that for the subgroup breakdowns, the ‘NA’ values are also removed using dplyr, otherwise ggplot plots the age distribution for ‘NA’ as a separate boxplot.

# A) Simple boxplot of one numeric variable
ggplot(data = linelist, aes(y = age))+  # only y variable given (no x variable)
  geom_boxplot()+
  ggtitle("A) Simple ggplot() boxplot")
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).
# B) Box plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,         # numeric variable
           x = outcome)) +      # group variable
  geom_boxplot(fill = "gold")+   # create the boxplot and specify colour
  ggtitle("B) ggplot() boxplot by gender")      # main title
## Warning: Removed 61 rows containing non-finite values (stat_boxplot).

Below is code for creating violin plots (geom_violin) and jitter plots (geom_jitter). One can specify that the ‘fill’ or ’color’is also determined by the data, thereby inserting these options within the aes bracket.

# A) Violin plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,         # numeric variable
           x = outcome,      # group variable
           fill = outcome))+ # fill variable (color of boxes)
  geom_violin()+                            # create the violin plot
  ggtitle("A) ggplot() violin plot by gender")      # main title
## Warning: Removed 61 rows containing non-finite values (stat_ydensity).
# B) Jitter plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)), 
       aes(y = age,         # numeric variable
           x = outcome,      # group variable
           color = outcome))+ # Color variable
  geom_jitter()+                            # create the violin plot
  ggtitle("B) ggplot() violin plot by gender")      # main title
## Warning: Removed 61 rows containing missing values (geom_point).

To examine further subgroups, one can ‘facet’ the graph. This means the plot will be recreased within specified subgroups. One can use:

  • facet_wrap() - this will recreate the sub-graphs and present them alphabetically (typically, unless stated otherwise). You can invoke certain options to determine the look of the facets, e.g. nrow=1 or ncol=1 to control the number of rows or columns that the faceted plots are arranged within. See plot A below.
  • facet_grid() - this is suited to seeing subgroups for particular combinations of discrete variables. See plot B below.
# A) Facet by one variable
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
       aes(y = age, x = outcome, fill=outcome))+
  geom_boxplot()+
  ggtitle("A) A ggplot() boxplot by gender and outcome")+
  facet_wrap(~gender, nrow = 1)

# B) Facet across two variables
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
       aes(y = age))+
  geom_boxplot()+
  ggtitle("A) A ggplot() boxplot by gender and outcome")+
  facet_grid(outcome~gender)

To turn the plot horizontal, flip the coordinates with coord_flip.

# By subgroup
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
       aes(y = age, x = outcome, fill=outcome))+
  geom_boxplot()+
  ggtitle("B) A horizontal ggplot() boxplot by gender and outcome")+
  facet_wrap(gender~., ncol=1) + 
  coord_flip()

Plotting two continuous variables

Following similar syntax, geom_point will allow one to plot two continuous variables against eachother in a scatter plot. Here we again use facet_grid to show the interaction between two different discrete variables.

# By subgroup
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
       aes(y = age, x = age))+
  geom_point()+
  ggtitle("A horizontal ggplot() boxplot by gender and outcome")+
  facet_grid(gender~outcome) 

Resources

Resources

There is a huge amount of help online, especially with ggplot. see:

Plotting discrete variables

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

base R

ggplot2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Tables

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Manually

From data frame

knitr::kable DT

Summarizing dataframe

From modelresults

For publication

Other

quickly changing the denominator (per 100,000, etc.)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Age pyramids

Age pyramids can be useful to show patterns by age group. They can show gender, or the distribution of other characteristics.
These tabs demonstrate how to produce age pyramids using:

  • Fast & easy: Using the apyramid package
  • More flexible: Using ggplot()
  • Having baseline demographics displayed in the background of the pyramid
  • Using pyramid-style plots to show other types of data (e.g responses to Likert-style questions)

Overview

Age/gender demographic pyramids in R are generally made with ggplot() by creating two barplots (one for each gender), converting one’s values to negative values, and flipping the x and y axes to display the barplots vertically.

Here we offer a quick approach through the apyramid package:

  • More customizable code using the raw ggplot() commands
  • How to combine case demographic data and compare with that of a baseline population (as shown above)
  • Application of these methods to show other types of data (e.g. responses to Likert-style survey questions)

Preparation

Preparation

For this tab we use the linelist dataset that is cleaned in the Cleaning tab.

To make a traditional age/sex demographic pyramid, the data must first be cleaned in the following ways:

  • The gender column must be cleaned.
  • Age should be in an age category column, and should be an of class Factor (with correctly ordered levels)

Load packages

First, load the packages required for this analysis:

pacman::p_load(rio,       # to import data
               here,      # to locate files
               tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
               apyramid,  # a package dedicated to creating age pyramids
               stringr)   # working with strings for titles, captions, etc.

Load the data

linelist <- rio::import("linelist_cleaned.csv")

Check class of variables

Ensure that the age variable is class Numeric, and check the class and order of levels of age_cat and age_cat5

class(linelist$age_years)
## [1] "numeric"
class(linelist$age_cat)
## [1] "factor"
class(linelist$age_cat5)
## [1] "factor"
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1081  1148   971   837  1091   628    45     0    88
table(linelist$age_cat5, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84   85+  <NA> 
##  1081  1148   971   837   600   491   295   181    98    54    26    14     2     3     0     0     0     0    88

apyramid package

apyramid package

The package apyramid allows you to quickly make an age pyramid. For more nuanced situations, see the tab on using ggplot() to make age pyramids. You can read more about the apyramid package in its Help page by entering ?age_pyramid in your R console.

Linelist data

Linelist data

Using the cleaned linelist dataset, we can create an age pyramid with just one simple command. If you need help cleaning your data, see the handbook page on Cleaning data (LINK). In this command:

  • The data argument is set as the linelist dataframe
  • The age_group argument is set to the name (in quotes) of the numeric category variable (in this case age_cat5)
  • The split_by argument (bar colors) should be a binary column (in this case “gender”)
apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender")
## Warning: 283 missing rows were removed (88 values from `age_cat5` and 283 values from `gender`).

When using agepyramid package, if the split_by column is binary (e.g. male/female, or yes/no), then the result will appear as a pyramid. However if there are more than two values in the split_by column (not including NA), the pyramid will appears as a faceted barplot with empty bars in the background indicating the range of the un-faceted data set for the age group. Values of split_by will appear as labels at top of each facet. For example below if the split_by variable is “hospital”.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "hospital",
                      na.rm = FALSE)        # show a bar for patients missing age, (note: this changes the pyramid into a faceted barplot)

Missing values
Rows missing values for the split_by or age_group columns, if coded as NA, will not trigger the faceting shown above. By default these rows will not be shown. However you can specify that they appear, in an adjacent barplot and as a separate age group at the top, by specifying na.rm = FALSE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      na.rm = FALSE)         # show patients missing age or gender

Proportions, colors, & aesthetics

By default, the bars display counts (not %), a dashed mid-line for each group is shown, and the colors are green/purple. Each of these parameters can all be adjusted, as shown below:

You can also add additional ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      proportional = TRUE,                  # show percents, not counts
                      show_midpoint = FALSE,                # remove bar mid-point line
                      #pal = c("orange", "purple")          # can specify alt. colors here (but not labels, see below)
                      )+                 
  
  # additional ggplot commands
  theme_minimal()+                                          # simplify the background
  scale_fill_manual(values = c("orange", "purple"),         # to specify colors AND labels
                     labels = c("Male", "Female"))+
  labs(y = "Percent of all cases",                          # note that x and y labels are switched (see ggplot tab)
       x = "Age categories",                          
       fill = "Gender", 
       caption = "My data source and caption here",
       title = "Title of my plot",
       subtitle = "Subtitle with \n a second line...")+
  theme(
    legend.position = "bottom",                             # move legend to bottom
    axis.text = element_text(size = 10, face = "bold"),     # fonts/sizes, see ggplot tips page
    axis.title = element_text(size = 12, face = "bold"))
## Warning: 283 missing rows were removed (88 values from `age_cat5` and 283 values from `gender`).
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.

Aggregated data

Aggregated data

The examples above assume your data are in a linelist-like format, with one row per observation. If your data are already aggregated into counts by age category, you can still use the apyramid package, as shown below.

Let’s say that your dataset looks like this, with columns for age category, and male counts, female counts, and missing counts.
(see the handbook page on Transforming data for tips)

## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
# View the aggregated data
DT::datatable(demo_agg, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

ggplot() perfers data in “long” format, so first pivot the data to be “long” with the pivot_longer() function from dplyr.

# pivot the aggregated data into long format
demo_agg_long <- demo_agg %>% 
  pivot_longer(c(f, m, missing_gender),            # cols to elongate
               names_to = "gender",                # name for new col of categories
               values_to = "counts") %>%           # name for new col of counts
  mutate(gender = na_if(gender, "missing_gender")) # convert "missing_gender" to NA
# View the aggregated data
DT::datatable(demo_agg_long, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Then use the split_by and count arguments of age_pyramid() to specify the respective columns:

apyramid::age_pyramid(data = demo_agg_long,
                      age_group = "age_cat5",
                      split_by = "gender",
                      count = "counts")      # give the column name for the aggregated counts
## Warning: Removed 19 rows containing missing values (position_stack).
## Warning: Removed 19 rows containing missing values.

Note in the above, that the factor order of “m” and “f” is different (pyramid reversed). To adjust the order you must re-define gender in the aggredated data as a Factor and order the levels as desired.

ggplot()

ggplot()

Using ggplot() to build your age pyramid allows for more flexibility, but requires more effort and understanding of how ggplot() works. It is also easier to accidentally make mistakes.

apyramid uses ggplot() in the background (and accepts ggplot() commands added), but this page shows how to adjust or recreate a pyramid only using ggplot(), if you wish.

Constructing the plot

Constructing the plot

First, understand that to make such a pyramid using ggplot() the approach is to:

  • Within the ggplot(), create two graphs by age category. Create one for each of the two grouping values (in this case gender). See filters applied to the data arguments in each geom_histogram() commands below.

  • If using geom_histogram(), the graphs operate off the numeric column (e.g. age_years), whereas if using geom_barplot() the graphs operate from an ordered Factor (e.g. age_cat5).

  • One graph will have positive count values, while the other will have its counts converted to negative values - this allows both graphs to be seen and compared against each other in the same plot.

  • The command coord_flip() switches the X and Y axes, resulting in the graphs turning vertical and creating the pyramid.

  • Lastly, the counts-axis labels must be specified so they appear as “positive” counts on both sides of the pyramid (despite the underlying values on one side being negative).

A simple version of this, using geom_histogram(), is below:

  # begin ggplot
  ggplot(data = linelist, aes(x = age, fill = gender)) +
  
  # female histogram
  geom_histogram(data = filter(linelist, gender == "f"),
                 breaks = seq(0,85,5),
                 colour = "white") +
  
  # male histogram (values converted to negative)
  geom_histogram(data = filter(linelist, gender == "m"),
                 breaks = seq(0,85,5),
                 aes(y=..count..*(-1)),
                 colour = "white") +
  
  # flip the X and Y axes
  coord_flip() +
  
  # adjust counts-axis scale
  scale_y_continuous(limits = c(-600, 900),
                     breaks = seq(-600,900,100),
                     labels = abs(seq(-600, 900, 100)))

DANGER: If the limits of your counts axis are set too low, and a counts bar exceeds them, the bar will disappear entirely or be artificially shortened! Watch for this if analyzing data which is routinely updated. Prevent it by having your count-axis limits auto-adjust to your data, as below.

There are many things you can change/add to this simple version, including:

  • Auto adjust counts-axis count scale to your data (avoid errors discussed in warning below)
  • Manually specify colors and legend labels
# create dataset with proportion of total
pyramid_data <- linelist %>%
  group_by(age_cat5, gender) %>% 
  summarize(counts = n()) %>% 
  ungroup() %>% 
  mutate(percent = round(100*(counts / sum(counts, na.rm=T)),1), 
         percent = case_when(
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))
## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)


# begin ggplot
  ggplot()+  # default x-axis is age in years;

  # case data graph
  geom_bar(data = pyramid_data,
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        # 
           colour = "white")+         # white around each bar
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  

  # adjust the axes scales (remember they are flipped now!)
  #scale_x_continuous(breaks = seq(0,100,5), labels = seq(0,100,5)) +
  scale_y_continuous(limits = c(min_per, max_per),
                     breaks = seq(floor(min_per), ceiling(max_per), 2),
                     labels = paste0(abs(seq(floor(min_per), ceiling(max_per), 2)), "%"))+

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",
               "m" = "darkgreen"),
    labels = c("Female", "Male"),
  ) +
  
  # label values (remember X and Y flipped now)
  labs(
    x = "Age group",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Data are from linelist \nn = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases) \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust=0, size=11, face = "italic")) + 
  
  ggtitle(paste0("Age and gender of cases"))
## Warning: Removed 12 rows containing missing values (position_stack).

Compare to baseline

Compare to baseline

With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the true population pyramid. This can provide a nice visualization to compare the observed counts with the baseline.

Import and view the population data

# import the population demographics data
pop <- rio::import("country_demographics.csv")
# display the linelist data as a table
DT::datatable(pop, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

First some data management steps:

Here we record the order of age categories that we want to appear. Due to some quirks the way the ggplot() is implemented, it is easiest to store these as a character vector and use them later in the plotting function.

# record correct age cat levels
age_levels <- c("0-4","5-9", "10-14", "15-19", "20-24",
                "25-29","30-34", "35-39", "40-44", "45-49",
                "50-54", "55-59", "60-64", "65-69", "70-74",
                "75-79", "80-84", "85+")

Combine the population and case data through the dplyr function bind_rows():

  • First, ensure they have the exact same column names, age categories values, and gender values
  • Make them have the same data structure: columns of age category, gender, counts, and percent of total
  • Bind them together, one on-top of the other (bind_rows())
# create/transform populaton data, with percent of total
########################################################
pop_data <- pivot_longer(pop, c(m, f), names_to = "gender", values_to = "counts") %>% # pivot gender columns longer
  mutate(data = "population",                                                         # add column designating data source
         percent  = round(100*(counts / sum(counts, na.rm=T)),1),                     # calculate % of total
         percent  = case_when(                                                        # if male, convert % to negative
                            gender == "f" ~ percent,
                            gender == "m" ~ -percent,
                            TRUE          ~ NA_real_))

Review the changed population dataset

# display the linelist data as a table
DT::datatable(pop_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Now implement the same for the case linelist. Slightly different because it begins with case-rows, not counts.

# create case data by age/gender, with percent of total
#######################################################
case_data <- linelist %>%
  group_by(age_cat5, gender) %>%  # aggregate linelist cases into age-gender groups
  summarize(counts = n()) %>%     # calculate counts per age-gender group
  ungroup() %>% 
  mutate(data = "cases",                                          # add column designating data source
         percent = round(100*(counts / sum(counts, na.rm=T)),1),  # calculate % of total for age-gender groups
         percent = case_when(                                     # convert % to negative if male
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))
## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.

Review the changed case dataset

# display the linelist data as a table
DT::datatable(case_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Now the two datasets are combined, one on top of the other (same column names)

# combine case and population data (same column names, age_cat values, and gender values)
pyramid_data <- bind_rows(case_data, pop_data)

Store the maximum and minimum percent values, used in the plotting funtion to define the extent of the plot (and not cut off any bars!)

# Define extent of percent axis, used for plot limits
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

Now the plot is made with ggplot():

  • One bar graph of population data (wider, more transparent bars)
  • One bar graph of case data (small, more solid bars)
# begin ggplot
##############
ggplot()+  # default x-axis is age in years;

  # population data graph
  geom_bar(data = filter(pyramid_data, data == "population"),
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        
           colour = "black",                               # black color around bars
           alpha = 0.2,                                    # more transparent
           width = 1)+                                     # full width
  
  # case data graph
  geom_bar(data = filter(pyramid_data, data == "cases"), 
           stat = "identity",                              # use % as given in data, not counting rows
           aes(x = age_cat5,                               # age categories as original X axis
               y = percent,                                # % as original Y-axis
               fill = gender),                             # fill of bars by gender
           colour = "black",                               # black color around bars
           alpha = 1,                                      # not transparent 
           width = 0.3)+                                   # half width
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  
  # adjust axes order, scale, and labels (remember X and Y axes are flipped now)
  # manually ensure that age-axis is ordered correctly
  scale_x_discrete(limits = age_levels)+ 
  
  # set percent-axis 
  scale_y_continuous(limits = c(min_per, max_per),                                          # min and max defined above
                     breaks = seq(floor(min_per), ceiling(max_per), by = 2),                # from min% to max% by 2 
                     labels = paste0(                                                       # for the labels, paste together... 
                       abs(seq(floor(min_per), ceiling(max_per), by = 2)),                  # ...rounded absolute values of breaks... 
                       "%"))+                                                               # ... with "%"
                                                                                            # floor(), ceiling() round down and up 

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",         # assign colors to values in the data
               "m" = "darkgreen"),
    labels = c("f" = "Female",
               "m"= "Male"),      # change labels that appear in legend, note order
  ) +

  # plot labels, titles, caption    
  labs(
    title = "Case age and gender distribution,\nas compared to baseline population",
    subtitle = "",
    x = "Age category",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Cases shown on top of country demographic baseline\nCase data are from linelist, n = {nrow(linelist)}\nAge or gender missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases\nCase data as of: {format(max(linelist$date_onset, na.rm=T), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    legend.position = "bottom",                             # move legend to bottom
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0), 
    plot.caption = element_text(hjust=0, size=11, face = "italic"))
## Warning: Removed 12 rows containing missing values (position_stack).

Likert scale

Likert scale

The techniques used to make a population pyramid with ggplot() can also be used to make plots of Likert-scale survey data.

Import the data

# import the likert survey response data
likert_data <- rio::import("likert_data.csv")

Start with data that looks like this, with a categorical classification of each respondent (status) and their answers to 8 questions on a 4-point Likert-type scale (“Very poor”, “Poor”, “Good”, “Very good”).

# display the linelist data as a table
DT::datatable(likert_data, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

First, some data management steps:

  • Pivot the data longer
  • Create new column direction depending on whether response was generally “positive” or “negative”
  • Set the Factor level order for the status column and the Response column
  • Store the max count value so limits of plot are appropriate
melted <- pivot_longer(likert_data, Q1:Q8, names_to = "Question", values_to = "Response") %>% 
     mutate(direction = case_when(
               Response %in% c("Poor","Very Poor") ~ "Negative",
               Response %in% c("Good", "Very Good") ~ "Positive",
               TRUE ~ "Unknown"),
            status = factor(status, levels = rev(c(
                 "Senior", "Intermediate", "Junior"))),
            Response = factor(Response, levels = c("Very Good", "Good",
                                             "Very Poor", "Poor"))) # must reverse Very Poor and Poor for ordering to work

melted_max <- melted %>% 
   group_by(status, Question) %>% 
   summarize(n = n())
## `summarise()` has grouped output by 'status'. You can override using the `.groups` argument.
melted_max <- max(melted_max$n, na.rm=T)

Now make the plot:

# make plot
ggplot()+
     # bar graph of the "negative" responses 
     geom_bar(data = filter(melted,
                            direction == "Negative"), 
              aes(x = status,
                        y=..count..*(-1),    # counts inverted to negative
                        fill = Response),
                    color = "black",
                    closed = "left", 
                    position = "stack")+
     
     # bar graph of the "positive responses
     geom_bar(data = filter(melted, direction == "Positive"),
              aes(x = status, fill = Response),
              colour = "black",
              closed = "left",
              position = "stack")+
     
     # flip the X and Y axes
     coord_flip()+
  
     # Black vertical line at 0
     geom_hline(yintercept = 0, color = "black", size=1)+
     
    # convert labels to all positive numbers
    scale_y_continuous(limits = c(-ceiling(melted_max/10)*11, ceiling(melted_max/10)*10),   # seq from neg to pos by 10, edges rounded outward to nearest 5
                       breaks = seq(-ceiling(melted_max/10)*10, ceiling(melted_max/10)*10, 10),
                       labels = abs(unique(c(seq(-ceiling(melted_max/10)*10, 0, 10),
                                            seq(0, ceiling(melted_max/10)*10, 10))))) +
     
    # color scales manually assigned 
    scale_fill_manual(values = c("Very Good"  = "green4", # assigns colors
                                  "Good"      = "green3",
                                  "Poor"      = "yellow",
                                  "Very Poor" = "red3"),
                       breaks = c("Very Good", "Good", "Poor", "Very Poor"))+ # orders the legend
     
    
     
    # facet the entire plot so each question is a sub-plot
    facet_wrap(~Question, ncol = 3)+
     
    # labels, titles, caption
    labs(x = "Respondent status",
          y = "Number of responses",
          fill = "")+
     ggtitle(str_glue("Likert-style responses\nn = {nrow(likert_data)}"))+

     # aesthetic settings
     theme_minimal()+
     theme(axis.text = element_text(size = 12),
           axis.title = element_text(size = 14, face = "bold"),
           strip.text = element_text(size = 14, face = "bold"),  # facet sub-titles
           plot.title = element_text(size = 20, face = "bold"),
           panel.background = element_rect(fill = NA, color = "black")) # black box around each facet
## Warning: Ignoring unknown parameters: closed

## Warning: Ignoring unknown parameters: closed

Resources

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Diagrams

Overview

This page covers:

  • Flow diagrams using DiagrammeR
  • Alluvial/Sankey diagrams
  • Event timelines
  • Dendrogram organizational trees (e.g. of folder contents)
  • DAGs (Directed Acyclic Graphs)

Preparation

Preparation

Load packages

pacman::p_load(
  DiagrammeR,     # for flow diagrams
  networkD3       # For alluvial/Sankey diagrams
  )

Flow diagrams

Flow diagrams

One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.

Tools

The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.

Basic structure

  1. Open the instructions grViz("
  2. Specify directionality and name of the graph, and open brackets, e.g. digraph my_flow_chart {
  3. Graph statement (layout, rank direction)
  4. Nodes statements (create nodes)
  5. Edges statements (gives links between nodes)
  6. Close the instructions }")

Simple examples

Simple examples

Below are two simple examples

A very minimal example:

# A minimal plot
DiagrammeR::grViz("digraph {
  
graph[layout = dot, rankdir = LR]

a
b
c

a -> b -> c
}")

An example with applied public health context:

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,
         overlap = true,
         fontsize = 10]
  
  # nodes
  #######
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]               # width of circles
  
  Primary                         # names of nodes
  Secondary
  Tertiary

  # edges
  #######
  Primary   -> Secondary [label = 'case transfer']
  Secondary -> Tertiary [label = 'case transfer']
}
")

Syntax

Syntax

Basic syntax

Node names, or edge statements, can be separated with spaces, semicolons, or newlines.

Rank direction

A plot can be re-oriented to move left-to-right by adjusting the rankdir argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.

Node names

Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. A label is also necessary to have a newline within the node name - use \n in the node label within single quotes, as shown below.

Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.

Layouts

  • dot (set rankdir to either TB, LR, RL, BT, )
  • neato
  • twopi
  • circo

Nodes - editable attributes

  • label (text, in single quotes if multi-word)
  • fillcolor (many possible colors)
  • fontcolor
  • alpha (transparency 0-1)
  • shape (ellipse, oval, diamond, egg, plaintext, point, square, triangle)
  • style
  • sides
  • peripheries
  • fixedsize (h x w)
  • height
  • width
  • distortion
  • penwidth (width of shape border)
  • x (displacement left/right)
  • y (displacement up/down)
  • fontname
  • fontsize
  • icon

Edges - editable attributes

  • arrowsize
  • arrowhead (normal, box, crow, curve, diamond, dot, inv, none, tee, vee)
  • arrowtail
  • dir (direction, )
  • style (dashed, …)
  • color
  • alpha
  • headport (text in front of arrowhead)
  • tailport (text in behind arrowtail)
  • fontname
  • fontsize
  • fontcolor
  • penwidth (width of arrow)
  • minlen (minimum length)

Color names: hexadecimal values or ‘X11’ color names, see here for X11 details

Complex examples

Complex examples

The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            # layout top-to-bottom
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]                      
  
  Primary   [label = 'Primary\nFacility'] 
  Secondary [label = 'Secondary\nFacility'] 
  Tertiary  [label = 'Tertiary\nFacility'] 
  SC        [label = 'Surveillance\nCoordination',
             fontcolor = darkgreen] 
  
  # edges
  #######
  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red,
                          color = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red,
                          color = red]
  
  # grouped edge
  {Primary Secondary Tertiary} -> SC [label = 'case reporting',
                                      fontcolor = darkgreen,
                                      color = darkgreen,
                                      style = dashed]
}
")

Sub-graph clusters

To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}). To have the subgraph identified within a box, begin the name with “cluster” as shown below.

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            
         overlap = true,
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,                  # shape = circle
       fixedsize = true
       width = 1.3]                      # width of circles
  
  subgraph cluster_passive {
    Primary   [label = 'Primary\nFacility'] 
    Secondary [label = 'Secondary\nFacility'] 
    Tertiary  [label = 'Tertiary\nFacility'] 
    SC        [label = 'Surveillance\nCoordination',
               fontcolor = darkgreen] 
  }
  
  # nodes (boxes)
  ###############
  node [shape = box,                     # node shape
        fontname = Helvetica]            # text font in node
  
  subgraph cluster_active {
    Active [label = 'Active\nSurveillance']; 
    HCF_active [label = 'HCF\nActive Search']
  }
  
  subgraph cluster_EBD {
    EBS [label = 'Event-Based\nSurveillance (EBS)']; 
    'Social Media'
    Radio
  }
  
  subgraph cluster_CBS {
    CBS [label = 'Community-Based\nSurveillance (CBS)'];
    RECOs
  }

  
  # edges
  #######
  {Primary Secondary Tertiary} -> SC [label = 'case reporting']

  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red]
  
  HCF_active -> Active
  
  {'Social Media'; Radio} -> EBS
  
  RECOs -> CBS
}
")

node shapes

The example below, borrowed from this tutorial, shows applied node shapes, and shows a shorthand for serial edge connections

DiagrammeR::grViz("digraph {

graph [layout = dot, rankdir = LR]

# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]

data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label =  'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']

# edge definitions with the node IDs
{data1 data2}  -> process -> statistical -> results
}")

Outputs

Outputs

How to handle and save outputs

  • Outputs will appear in RStudio’s Viewer pane, by default in the lower-right alongside Files, Plots, Packages, and Help.
  • To export you can “Save as image” or “Copy to clipboard” from the Viewer. The graphic will adjust to the specified size.

Parameterized figures

Parameterized figures

“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index. Here is a basic example:”
https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/

# Define some sample data
data <- list(a=1000, b=800, c=600, d=400)


DiagrammeR::grViz("
digraph graph2 {

graph [layout = dot]

# node definitions with substituted label text
node [shape = rectangle, width = 4, fillcolor = Biege]
a [label = '@@1']
b [label = '@@2']
c [label = '@@3']
d [label = '@@4']

a -> b -> c -> d

}

[1]:  paste0('Raw Data (n = ', data$a, ')')
[2]: paste0('Remove Errors (n = ', data$b, ')')
[3]: paste0('Identify Potential Customers (n = ', data$c, ')')
[4]: paste0('Select Top Priorities (n = ', data$d, ')')
")

Much of the above is adapted from the tutorial at this site

Other more in-depth tutorial: http://rich-iannone.github.io/DiagrammeR/

CONSORT diagram

CONSORT diagram

https://scriptsandstatistics.wordpress.com/2017/12/22/how-to-draw-a-consort-flow-diagram-using-r-and-graphviz/

Note above is out of date via DiagrammeR

Alluvial/Sankey Diagrams

Alluvial/Sankey Diagrams

Preparation

Preparation

Load packages

pacman::p_load(networkD3)

Plotting from dataset

Plotting from dataset

Plotting the connections in a dataset

https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html

Counts of age category and hospital, relabled as target and source, respectively.

# counts by hospital and age category
links <- linelist %>% 
  select(hospital, age_cat) %>%
  count(hospital, age_cat) %>% 
  rename(source = hospital,
         target = age_cat)

Now formalize the nodes list, and adjust the ID columns to be numbers instead of labels:

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

Now plot the Sankey diagram:

# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30)
p

Here is an example where the patient Outome is included as well. Note in the data management step how we bind rows of counts of hospital -> outcome, using the same column names.

# counts by hospital and age category
links <- linelist %>% 
  select(hospital, age_cat) %>%
  mutate(age_cat = stringr::str_glue("Age {age_cat}")) %>% 
  count(hospital, age_cat) %>% 
  rename(source = age_cat,
         target = hospital) %>% 
  bind_rows(
    linelist %>% 
      select(hospital, outcome) %>% 
      count(hospital, outcome) %>% 
      rename(source = hospital,
             target = outcome)
  )

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30)
p

https://www.displayr.com/sankey-diagrams-r/

Timeline Sankey - LTFU from cohort… application/rejections… etc.

Event timelines

Event timelines

To make a timeline showing specific events, you can use the vistime package.

See this vignette

# load package
pacman::p_load(vistime,  # make the timeline
               plotly    # for interactive visualization
               )

Here is the events dataset we begin with:

p <- vistime(data)    # apply vistime

library(plotly)

# step 1: transform into a list
pp <- plotly_build(p)

# step 2: Marker size
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "markers") pp$x$data[[i]]$marker$size <- 10
}

# step 3: text size
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textfont$size <- 10
}


# step 4: text position
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textposition <- "right"
}

#print
pp

DAGs

DAGs

You can build a DAG manually using the DiagammeR package and DOT language, as described in another tab. Alternatively, there are packages like ggdag and daggity

https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-dags.html

https://www.r-bloggers.com/2019/08/causal-inference-with-dags-in-r/#:~:text=In%20a%20DAG%20all%20the,for%20drawing%20and%20analyzing%20DAGs.

Resources

Resources

Links to other online tutorials or resources.

Combination analysis

Overview

This analysis plots the frequency of different combinations of values/responses. In this example, we plot the frequency of symptom combinations.

This analysis is often called:
Multiple response analysis Sets analysis Combinations analysis

The first method shown uses the package ggupset, an the second using the package UpSetR.

An example plot is below. Five symptoms are shown. Below each vertical bar is a line and dots indicating the combination of symptoms reflected by the bar above. To the right, horizontal bars reflect the frequency of each individual symptom.

Preparation

pacman::p_load(tidyverse,
               UpSetR,
               ggupset)

View the data

This linelist includes five “yes/no” variables on reported symptoms. We will need to transform these variables a bit to use the ggupset package to make our plot.

View the data (scroll to the right to see the symptoms variables)

## Warning in instance$preRenderHook(instance): It seems your data is too big for client-side DataTables. You may consider server-side
## processing: https://rstudio.github.io/DT/server.html

Re-format values

We convert the “yes” and “no the the actual symptom name. If”no", we set the value as blank.

# create column with the symptoms named, separated by semicolons
linelist_sym_1 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(fever = case_when(fever == "yes" ~ "fever",          # if old value is "yes", new value is "fever"
                           TRUE           ~ NA_character_),   # if old value is anything other than "yes", the new value is NA
         
         chills = case_when(chills == "yes" ~ "chills",
                           TRUE           ~ NA_character_),
         
         cough = case_when(cough == "yes" ~ "cough",
                           TRUE           ~ NA_character_),
         
         aches = case_when(aches == "yes" ~ "aches",
                           TRUE           ~ NA_character_),
         
         shortness_of_breath = case_when(shortness_of_breath == "yes" ~ "shortness_of_breath",
                           TRUE           ~ NA_character_))

Now we make two final variables:
1. Pasting together all the symptoms of the patient (character variable)
2. Convert the above to class list, so it can be accepted by ggupset to make the plot

linelist_sym_1 <- linelist_sym_1 %>% 
  mutate(
         # combine the variables into one, using paste() with a semicolon separating any values
         all_symptoms = paste(fever, chills, cough, aches, shortness_of_breath, sep = "; "),
         
         # make a copy of all_symptoms variable, but of class "list" (which is required to use ggupset() in next step)
         all_symptoms_list = as.list(strsplit(all_symptoms, "; "))
         )

View the new data. Note the two columns at the end - the pasted combined values, and the list

DT::datatable(linelist_sym, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
## Warning in instance$preRenderHook(instance): It seems your data is too big for client-side DataTables. You may consider server-side
## processing: https://rstudio.github.io/DT/server.html

ggupset

Load required package to make the plot (ggupset)

pacman::p_load(ggupset)

Create the plot:

ggplot(linelist_sym_1,
       aes(x=all_symptoms_list)) +
geom_bar() +
scale_x_upset(reverse = FALSE,
              n_intersections = 10,
              sets = c("fever", "chills", "cough", "aches", "shortness_of_breath")
              )+
  labs(title = "Signs & symptoms",
       subtitle = "10 most frequent combinations of signs and symptoms",
       caption = "Caption here.",
       x = "Symptom combination",
       y = "Frequency in dataset")
## Warning: Removed 720 rows containing non-finite values (stat_count).

More information on ggupset can be found online or offline in the package documentation in your RStudio Help tab.

UpSetR

The UpSetR package allows more customization, but it more difficult to execute:

https://github.com/hms-dbmi/UpSetR read this https://gehlenborglab.shinyapps.io/upsetr/ Shiny App version - you can upload your own data https://cran.r-project.org/web/packages/UpSetR/UpSetR.pdf documentation - difficult to interpret

pacman::p_load(UpSetR)

Convert symptoms variables to 1/0.

# Make using upSetR

linelist_sym_2 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(fever = case_when(fever == "yes" ~ 1,          # if old value is "yes", new value is "fever"
                           TRUE           ~ 0),   # if old value is anything other than "yes", the new value is NA
         
         chills = case_when(chills == "yes" ~ 1,
                           TRUE           ~ 0),
         
         cough = case_when(cough == "yes" ~ 1,
                           TRUE           ~ 0),
         
         aches = case_when(aches == "yes" ~ 1,
                           TRUE           ~ 0),
         
         shortness_of_breath = case_when(shortness_of_breath == "yes" ~ 1,
                           TRUE           ~ 0))

Now make the plot, using only the symptom variables. Must designate which “sets” to compare (the names of the symptom variables).
Alternatively use nsets = and order.by = "freq" to only show the top X combinations.

# Make the plot
UpSetR::upset(
  select(linelist_sym_2, fever, chills, cough, aches, shortness_of_breath),
  sets = c("fever", "chills", "cough", "aches", "shortness_of_breath"),
  order.by = "freq",
  sets.bar.color = c("blue", "red", "yellow", "darkgreen", "orange"), # optional colors
  empty.intersections = "on",
  # nsets = 3,
  number.angles = 0,
  point.size = 3.5,
  line.size = 2, 
  mainbar.y.label = "Symptoms Combinations",
  sets.x.label = "Patients with Symptom")

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Heatmaps & density plots

Overview

Heatmaps can be useful when tracking reporting metrics across many facilities/jurisdictions over time

For example, the image below shows % of weekdays that data was received from each facility, week-by-week:

# Create weekly summary dataset
###############################
agg_weeks <- facility_count_data %>% 
  
  # filter the data as appropriate
  filter(District == "Spring",
         data_date < as.Date("2019-06-01")) %>% 
  
  # Create week column from data_date
  mutate(week = aweek::date2week(data_date,
                                 start_date = "Monday",
                                 floor_day = TRUE,
                                 factor = TRUE)) %>% 
  # Group into facility-weeks
  group_by(location_name, week, .drop = F) %>%
  
  # Create summary column on the grouped data
  summarize(n_days          = 7,                                          # 7 days per week           
            n_reports       = dplyr::n(),                                 # number of reports received per week (could be >7)
            malaria_tot     = sum(malaria_tot, na.rm = T),                # total malaria cases reported
            n_days_reported = length(unique(data_date)),                  # number of unique days reporting per week
            p_days_reported = round(100*(n_days_reported / n_days))) %>%  # percent of days reporting
  
  # Ensure every possible facility-week combination appears in the data
  right_join(tidyr::expand(., week, location_name))    # "." represents the dataset at that moment in the pipe chain
## `summarise()` has grouped output by 'location_name'. You can override using the `.groups` argument.
## Joining, by = c("location_name", "week")
# METRICS PLOT
##############
ggplot(agg_weeks,
       aes(x = aweek::week2date(week),            # transformed to date class
           y = location_name,
           fill = p_days_reported))+
  # tiles
  geom_tile(colour="white")+                      # white gridlines
  
  scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
  scale_x_date(expand = c(0,0),
               date_breaks = "2 weeks",
               date_labels = "%d\n%b")+
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),         # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),       # width of legend key
    
    axis.text.x = element_text(size=12),
    axis.text.y = element_text(vjust=0.2),
    axis.ticks = element_line(size=0.4),
    axis.title = element_text(size=12, face="bold"),
    
    plot.title = element_text(hjust=0,size=14,face="bold"),
    plot.caption = element_text(hjust = 0, face = "italic")
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)", # legend title
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

################
# DENSITY MAP
################
pacman::p_load(OpenStreetMap)

# Fit basemap by range of lat/long coordinates. Choose tile type
map <- openmap(c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)),  # limits of tile
               c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
               zoom = NULL,
               type = c("osm", "stamen-toner", "stamen-terrain","stamen-watercolor", "esri","esri-topo")[1],
               mergeTiles = TRUE)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
# Projection WGS84
map.latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

# Plot map. Must be autoplotted to work with ggplot
OpenStreetMap::autoplot.OpenStreetMap(map.latlon)+
  # Density tiles  
  ggplot2::stat_density_2d(aes(x = lon,
        y = lat,
        fill = ..level..,
        alpha =..level..),
    bins = 10,
    geom = "polygon",
    contour_var = "count",
    data = linelist,
    show.legend = F) +
  scale_fill_gradient(low = "black", high = "red")+
  labs(x = "Longitude",
       y = "Latitude",
       caption = "OpenStreetMap base tile",
       title = "Distribution density of simulated cases")

Preparation

pacman::p_load(OpenStreetMap,
               aweek)

Reporting metrics over time

Reporting metrics over time

Often in public health, an objective is to assess trends over time for many entities (facilities, jurisdictions, etc.). One way to visualize trends over time from many entities is a heatmap where the x-axis is time and the y-axis are the many entities.

Preparation

Preparation

To demonstrate this, we import this dataset of daily malaria case reports from 65 facilities.

The preparation will involve:

  • Importing and reviewing the data
  • Aggregating the daily data into weekly, and summarizing weekly performance

Load and view

Below are the first 30 rows of these data:

Packages

The packages we will use are:

pacman::p_load(tidyverse, # ggplot and data manipulation
               rio,       # importing data
               aweek)     # manage weeks

Aggregate and summarize

The objective is to transform the daily reports (seen in previous tab) into weekly reports with a summary of performance - in this case the proportion of days per week that the facility reported any data for Spring District from April-May 2019.

To achieve this:

  1. Filter the data as appropriate (by place, date)
  2. Create a week column using date2week() from package aweek
    • This function transforms dates to weeks, using a specified start date of each week (e.g. “Monday”)
    • The floor_day = argument means that dates are rounded into the week only (day of the week is not shown)
    • The factor = argument converts the new column to a factor - important because all possible weeks within the date range are designated as levels, even if there is no data for them currently.
  3. The data are grouped by columns “location” and “week” to create analysis units of “facility-week”
  4. The verb summarize() creates new columns to calculate reporting performance for each “facility-week”:
    • Number of days per week (7 - a static value)
    • Number of reports received from the facility-week (could be more than 7!)
    • Sum of malaria cases reported by the facility-week (just for interest)
    • Number of unique days in the facility-week for which there is data reported
    • Percent of the 7 days per facility-week for which data was reported
  5. The dataframe is joined (right_join()) to a comprehensive list of all possible facility-week combinations, to make the dataset complete. The matrix of all possible combinations is created by applying expand() to those two columns of the dataframe as it is at that moment in the pipe chain (represented by “.”). Because a right_join() is used, all rows in the expand() dataframe are kept, and added to agg_weeks if necessary. These new rows appear with NA (missing) summarized values.
# Create weekly summary dataset
agg_weeks <- facility_count_data %>% 
  
  # filter the data as appropriate
  filter(District == "Spring",
         data_date < as.Date("2019-06-01")) %>% 
  
  # Create week column from data_date
  mutate(week = aweek::date2week(data_date,
                                 start_date = "Monday",
                                 floor_day = TRUE,
                                 factor = TRUE)) %>% 
  # Group into facility-weeks
  group_by(location_name, week, .drop = F) %>%
  
  # Create summary column on the grouped data
  summarize(n_days          = 7,                                          # 7 days per week           
            n_reports       = dplyr::n(),                                 # number of reports received per week (could be >7)
            malaria_tot     = sum(malaria_tot, na.rm = T),                # total malaria cases reported
            n_days_reported = length(unique(data_date)),                  # number of unique days reporting per week
            p_days_reported = round(100*(n_days_reported / n_days))) %>%  # percent of days reporting
  
  # Ensure every possible facility-week combination appears in the data
  right_join(tidyr::expand(., week, location_name))    # "." represents the dataset at that moment in the pipe chain
## `summarise()` has grouped output by 'location_name'. You can override using the `.groups` argument.
## Joining, by = c("location_name", "week")

Create heatmap

Create heatmap

The ggplot() is make using geom_tile():

  • Weeks on the x-axis is transformed to dates, allowing use of scale_x_date()
  • location_name on the y-axis will show all facility names
  • The fill is the performance for that facility-week (numeric)
  • scale_fill_gradient() is used on the numeric fill, specifying colors for high, low, and NA
  • scale_x_date() is used on the x-axis specifying labels every 2 weeks and their format
  • Aesthetic themes and labels can be adjusted as necessary

Basic

Basic

ggplot(agg_weeks,
       aes(x = aweek::week2date(week),            # transformed to date class
           y = location_name,
           fill = p_days_reported))+
  # tiles
  geom_tile(colour="white")+                      # white gridlines
  
  scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
  scale_x_date(expand = c(0,0),
               date_breaks = "2 weeks",
               date_labels = "%d\n%b")+
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),         # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),       # width of legend key
    
    axis.text.x = element_text(size=12),
    axis.text.y = element_text(vjust=0.2),
    axis.ticks = element_line(size=0.4),
    axis.title = element_text(size=12, face="bold"),
    
    plot.title = element_text(hjust=0,size=14,face="bold"),
    plot.caption = element_text(hjust = 0, face = "italic")
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)", # legend title
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

Ordered y-axis

Ordered y-axis

If you want to order the y-axis facilities by something, convert them to class Factor and provide the order. Below, the order is set based on the total number of reporting days filed by the facility across the whole timespan:

facility_order <- agg_weeks %>% 
  group_by(location_name) %>% 
  summarize(tot_reports = sum(n_days_reported, na.rm=T)) %>% 
  arrange(tot_reports) # ascending order
as.tibble(facility_order)
## # A tibble: 15 x 2
##    location_name tot_reports
##    <chr>               <int>
##  1 Facility 56             1
##  2 Facility 65             6
##  3 Facility 11            19
##  4 Facility 39            31
##  5 Facility 59            33
##  6 Facility 27            40
##  7 Facility 32            41
##  8 Facility 51            41
##  9 Facility 7             42
## 10 Facility 1             46
## 11 Facility 9             48
## 12 Facility 35            50
## 13 Facility 50            51
## 14 Facility 58            53
## 15 Facility 28            75

Now use the above vector (facility_order$location_name) to be the order of the factor levels of location_name in the dataset agg_weeks:

agg_weeks <- agg_weeks %>% 
  mutate(location_name = factor(location_name, levels = facility_order$location_name))

And now the data are re-plotted, with location_name being an ordered factor:

ggplot(agg_weeks,
       aes(x = aweek::week2date(week),            # transformed to date class
           y = location_name,
           fill = p_days_reported))+
  # tiles
  geom_tile(colour="white")+                      # white gridlines

  scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
  scale_x_date(expand = c(0,0),
               date_breaks = "2 weeks",
               date_labels = "%d\n%b")+
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),         # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),       # width of legend key
    
    axis.text.x = element_text(size=12),
    axis.text.y = element_text(vjust=0.2),
    axis.ticks = element_line(size=0.4),
    axis.title = element_text(size=12, face="bold"),
    
    plot.title = element_text(hjust=0,size=14,face="bold"),
    plot.caption = element_text(hjust = 0, face = "italic")
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)", # legend title
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

Display values

Display values

You can add a geom_text() layer on top of the tiles, to display the actual numbers of each tile. Be aware this may not look pretty if you have many small tiles!

  • Note the fillowing code added geom_text(aes(label=p_days_reported))+. In the aesthetic aes() of the geom_tile() the argument label (what to show) is set to the same numeric column used to create the color gradient.
ggplot(agg_weeks,
       aes(x = aweek::week2date(week),            # transformed to date class
           y = location_name,
           fill = p_days_reported))+
  # tiles
  geom_tile(colour="white")+                      # white gridlines
  
  geom_text(aes(label = p_days_reported))+          # add text on top of tile
  
  scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
  scale_x_date(expand = c(0,0),
               date_breaks = "2 weeks",
               date_labels = "%d\n%b")+
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),         # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),       # width of legend key
    
    axis.text.x = element_text(size=12),
    axis.text.y = element_text(vjust=0.2),
    axis.ticks = element_line(size=0.4),
    axis.title = element_text(size=12, face="bold"),
    
    plot.title = element_text(hjust=0,size=14,face="bold"),
    plot.caption = element_text(hjust = 0, face = "italic")
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)", # legend title
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

GIS heatmaps

Contoured heatmap of cases over a basemap

  1. Create basemap tile from OpenStreetMap
  2. Plot the cases from linelist using the latitude and longitude

http://data-analytics.net/cep/Schedule_files/geospatial.html

pacman::p_load(OpenStreetMap)

# Fit basemap by range of lat/long coordinates. Choose tile type
map <- openmap(c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)),  # limits of tile
               c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
               zoom = NULL,
               type = c("osm", "stamen-toner", "stamen-terrain","stamen-watercolor", "esri","esri-topo")[1],
               mergeTiles = TRUE)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded ellps WGS 84 in CRS definition:
## +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj = prefer_proj): Discarded datum World Geodetic System 1984
## in CRS definition
# Projection WGS84
map.latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

# Plot map. Must be autoplotted to work with ggplot
OpenStreetMap::autoplot.OpenStreetMap(map.latlon)+
  # Density tiles  
  ggplot2::stat_density_2d(aes(x = lon,
        y = lat,
        fill = ..level..,
        alpha=..level..),
    bins = 10,
    geom = "polygon",
    contour_var = "count",
    data = linelist,
    show.legend = F) +
  scale_fill_gradient(low = "black", high = "red")+
  labs(x = "Longitude",
       y = "Latitude",
       title = "Distribution of simulated cases")

https://www.rdocumentation.org/packages/OpenStreetMap/versions/0.3.4/topics/autoplot.OpenStreetMap

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Transmission Chains

Overview

The primary tool to visualize and analyze transmission chains is the package epicontacts, developed by the folks at RECON.

## Warning in epicontacts::make_epicontacts(linelist = mers_korea_2015$linelist, : Cycle(s) detected in the contact network: this may be
## unwanted
## Warning in vis_temporal_interactive(x, x_axis = x_axis, node_color = node_color, : 14 nodes and 14 edges removed as x_axis data is
## unavailable

Preparation

Visualization

links <- epicontacts::make_epicontacts(linelist = mers_korea_2015$linelist,
                                       contacts = mers_korea_2015$contacts, 
                                       directed = TRUE)
## Warning in epicontacts::make_epicontacts(linelist = mers_korea_2015$linelist, : Cycle(s) detected in the contact network: this may be
## unwanted
# plot without time
plot(links,
     selector = FALSE,
     height = 700,
     width = 700)

And in a transmission tree, with date of onset on the x-axis:

Note: this currently requires installing a development version of epicontacts from github… @ttree

# plot with date of onset as x-axis
plot(sim,
     x_axis = 'onset',
     height = 700,
     width = 700,
)
## Warning in vis_temporal_interactive(x, x_axis = x_axis, node_color = node_color, : 14 nodes and 14 edges removed as x_axis data is
## unavailable

Analysis

summary(links)
## 
## /// Overview //
##   // number of unique IDs in linelist: 162
##   // number of unique IDs in contacts: 97
##   // number of unique IDs in both: 97
##   // number of contacts: 98
##   // contacts with both cases in linelist: 100 %
## 
## /// Degrees of the network //
##   // in-degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.6049  1.0000  3.0000 
## 
##   // out-degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.6049  0.0000 38.0000 
## 
##   // in and out degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    1.21    1.00   39.00 
## 
## /// Attributes //
##   // attributes in linelist:
##  age age_class sex place_infect reporting_ctry loc_hosp dt_onset dt_report week_report dt_start_exp dt_end_exp dt_diag outcome dt_death
## 
##   // attributes in contacts:
##  exposure diff_dt_onset

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Phylogenetic trees

Overview

Overview

Phylogenetic trees are used to visualize and describe the relatedness and evolution of organisms based on the sequence of their genetic code. They can be constructed from genetic sequences using distance-based methods (such as neighbor-joining method) or character-based methods (such as maximum likelihood and Bayesian Markov Chain Monte Carlo method). Next-generation sequencing (NGS) has become more affordable and is becoming more widely used in public health to describe pathogens causing infectious diseases. Portable devices decrease the turn around time and make data available for the support of outbreak investigation in real-time. NGS data can be used to identify the origin or source of an outbreak strain and its propagation, as well as determine presence of antimicrobial resistance genes. To visualize the genetic relatedness between samples a phylogenetic tree is constructed. In this page we will learn how to use the ggtree() package, which allows for combination of phylogenetic trees with additional sample data in form of a dataframe in order to help observe patterns and improve understanding of the outbreak dynamic.

Preparation

Preparation

This code chunk shows the loading of required packages:

# First we load the pacman package:
library(pacman)

# This allows us to load multiple packages at the same time in one line of code:
pacman::p_load(here, ggplot2, dplyr, ape, ggtree, treeio, ggnewscale)

There are several different formats in which a phylogenetic tree can be stored (eg. Newick, NEXUS, Phylip). A common one, which we will also use here in this example is the Newick file format (.nwk), which is the standard for representing trees in computer-readable form. Which means, an entire tree can be expressed in a string format such as “((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59);” listing all nodes and tips and their relationship (branch length) to each other.

It is important to understand that the phylogenetic tree file in itself does not contain sequencing data, but is merely the result of the distances between the sequences. We therefore cannot extract sequencing data from a tree file.

We use the ape() package to import a phylogenetic tree file and store it in a list object of class “phylo”. We inspect our tree object and see it contains 299 tips (or samples) and 236 nodes.

# read in the tree: we use the here package to specify the location of our R project and data files:
tree <- ape::read.tree(here::here("data", "Shigella_tree.nwk"))

# inspect the tree file:
tree
## 
## Phylogenetic tree with 299 tips and 236 internal nodes.
## 
## Tip labels:
##   SRR5006072, SRR4192106, S18BD07865, S18BD00489, S17BD08906, S17BD05939, ...
## Node labels:
##   17, 29, 100, 67, 100, 100, ...
## 
## Rooted; includes branch lengths.

Second we import a table with additional information for each sequenced sample such as gender, country of origine and attributes for antimicrobial resistance:

# We read in a csv file into a dataframe format:
sample_data <- read.csv(here::here("data","sample_data_Shigella_tree.csv"),sep=",", na.strings=c("NA"), head = TRUE, stringsAsFactors=F)

We clean and inspect our data: In order to assign the correct sample data to the phylogenetic tree, the Sample_IDs in the sample_data file need to match the tip.labels in the tree file:

# We clean the data: we select certain columns to be protected from cleaning in order to main tain their formating (eg. for the sample names, as they have to match the names in the phylogenetic tree file)
#sample_data <- linelist::clean_data(sample_data, protect = c(1, 3:5)) 

# We check the formatting of the tip labels in the tree file: 

head(tree$tip.label) # these are the sample names in the tree - we inspect the first 6 with head()
## [1] "SRR5006072" "SRR4192106" "S18BD07865" "S18BD00489" "S17BD08906" "S17BD05939"
# We make sure the first column in our dataframe are the Sample_IDs:
colnames(sample_data)   
##  [1] "Sample_ID"                  "serotype"                   "Country"                    "Continent"                 
##  [5] "Travel_history"             "Year"                       "Patient_age"                "Source"                    
##  [9] "Gender"                     "gyrA_mutations"             "macrolide_resistance_genes" "ESBL"                      
## [13] "MIC_AZM"                    "MIC_CIP"
# We look at the sample_IDs in the dataframe to make sure the formatting is the same than in the tip.labels (eg. letters are all capital, no extra _ between letters and numbers etc.)
head(sample_data$Sample_ID) # we inspect only the first 6 using head()
## [1] "ERR025692" "ERR025682" "ERR025714" "ERR025713" "ERR025709" "ERR025711"

Upon inspection we can see that the format of sample_ID in the dataframe corresponds to the format of sample names at the tree tips. These do not have to be sorted in the same order to be matched.

We are ready to go!

Simple tree visualization

Simple tree visualization

Different tree layouts:

ggtree() offers many different layout formats and some may be more suitable for your specific purpose than others:

# Examples:
ggtree(tree) # most simple linear tree
ggtree(tree,  branch.length = "none") # most simple linear tree with all tips aligned
ggtree(tree, layout="circular") # most simple circular tree
ggtree(tree, layout="circular", branch.length = "none") # most simple circular tree with all tips aligned

# for other options see online: http://yulab-smu.top/treedata-book/chapter4.html

Simple tree with addition of sample data:

The most easy annotation of your tree is the addition of the sample names at the tips, as well as coloring of tip points and if desired branches:

# A: Plot Circular tree:
ggtree(tree, layout="circular", branch.length='none') %<+% sample_data + # the %<+% is used to add your dataframe with sample data to the tree
  aes(color=I(Source))+ # color the branches according to a variable in your dataframe
  scale_color_manual(name = "Sample Origin", # name of your color scheme (will show up in the legend like this)
                     breaks = c("NRC BEL", "NA"), # the different options in your variable
                     labels = c("NRCSS Belgium", ""), # how you want the different options named in your legend, allows for formatting
                     values= c("blue"), # the color you want to assign to the variable if its "nrc_bel"
                     na.value="grey")+ # for the NA values we choose the color grey
  new_scale_color()+ # allows to add an additional color scheme for another variable
     geom_tippoint(aes(color=Continent), size=1.5)+ # color the tip point by continent, you may change shape adding "shape = "
scale_color_brewer(name = "Continent",  # name of your color scheme (will show up in the legend like this)
                       palette="Set1", # we choose a premade set of colors coming with the brewer package
                   na.value="grey")+ # for the NA values we choose the color grey
  geom_tiplab(color='black', offset = 1, size = 1, geom = "text" , align=TRUE)+ # add the name of the sample to the tip of its branch (you can add as many text lines as you like with the + , you just need to change the offset value to place them next to each other)
  ggtitle("Phylogenetic tree of Shigella sonnei")+ # title of your graph
  theme(axis.title.x=element_blank(), # removes x-axis title
      axis.title.y=element_blank(), # removes y-axis title
     legend.title=element_text(face="bold", size =12), # defines font size and format of the legend title
       legend.text=element_text(face="bold", size =10), # defines font size and format of the legend text
      plot.title = element_text(size =12, face="bold"),  # defines font size and format of the plot title
     legend.position="bottom", # defines placement of the legend
        legend.box="vertical", legend.margin=margin()) # defines placement of the legend
## Warning: Duplicated aesthetics after name standardisation: size

## Warning: Duplicated aesthetics after name standardisation: size

# Export your tree graph:
ggsave(here::here("example_tree_circular_1.png"), width = 12, height = 14)

Manipulation of phylogenetic trees

Manipulation of phylogenetic trees

Sometimes you may have a very large phylogenetic tree and you are only interested in one part of the tree. For example if you produced a tree including historical or international samples to get a large overview of where your dataset might fit in in the bigger picture. But then to look closer at your data you want to inspect only that portion of the bigger tree.

Since the phylogenetic tree file is just the output of sequencing data analysis, we can not manipulate the order of the nodes and branches in the file itself. These have already been determined in previous analysis from the raw NGS data. We are able though to zoom into parts, hide parts and seven subset part of the tree.

Zooming in on one part of the tree:

If you don’t want to “cut” your tree, but only inspect part of it more closely you can zoom in to view a specific part:

# First we plot the whole tree:
p <- ggtree(tree,) %<+% sample_data +
  geom_tiplab(size =1.5) + # labels the tips of all branche with the sample name in the tree file
  geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust=1, vjust =1) # labels all the nodes in the tree
p

We want to zoom into the branch which is sticking out, after node number 452 to get a closer look:

viewClade(p , node=452)

Collapsing one part of the tree:

The other way around we may want to ignore this branch which is sticking out and can do so by collapsing it at the node (indicated here by the blue square):

#First we collapse at node 452
p_collapsed <- collapse(p, node=452)

#To not forget that we collapsed this node we assign a symbol to it:
p_collapsed + geom_point2(aes(subset=(node == 452)), size=5, shape=23, fill="steelblue")
## Warning: Ignoring unknown aesthetics: subset
## Warning: Removed 83 rows containing missing values (geom_point).

Subsetting a tree:

If we want to make a more permanent change and create a new tree to work with we can subset part of it and even save it as new newick tree file.

# To do so you can add the node and tip labels to your tree to see which part you want to subset:
ggtree(tree, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =1) + # labels the tips of all branche with the sample name in the tree file
  geom_text2(aes(subset=!isTip, label=node), size =3, color = "darkred") +# labels all the nodes in the tree
 theme(legend.position = "none", # removes the legend all together
 axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      plot.title = element_text(size =12, face="bold"))

# A: Subset tree based on node:
sub_tree1 <- tree_subset(tree, node = 528) # we subset the tree at node 528
# lets have a look at the subset tree:
ggtree(sub_tree1)+  geom_tiplab(size =3) +
  ggtitle("Subset tree 1")

# B: Subset the same part of the tree based on a samplem in this case S17BD07692:
sub_tree2 <- tree_subset(tree,"S17BD07692", levels_back = 9) # levels back defines how many nodes backwards from the sample tip you want to go
# lets have a look at the subset tree:
ggtree(sub_tree2)+  geom_tiplab(size =3)  +
  ggtitle("Subset tree 2")

You can also save your new tree as a Newick file:

ape::write.tree(sub_tree2, file='Shigelle_subtree_2.nwk')

Rotating nodes in a tree:

As mentioned before we cannot change the order of tips or nodes in the tree, as this is based on their genetic relatedness and is not subject to visual manipulation. But we can rote branches around nodes if that eases our visualization.

First we plot our new subsetted tree with nodelabels to choose the node we want to manipulate:

p <- ggtree(sub_tree2) +  geom_tiplab(size =4) +
  geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust =1, vjust =1) # labels all the nodes in the tree
p

We choose to manipulate node number 39: we do so by applying ggtree::rotate() or ggtree::fluip() indirectly to node 36 so node 39 moves to the bottom and nodes 37 and 38 move to the top:

# 
# p1 <- p + geom_hilight(39, "steelblue", extend =0.0015)+ # highlights the node 39 in blue
#    geom_hilight(37, "yellow", extend =0.0015)  + # highlights the node 37 in yellow
#   ggtitle("Original tree")
# 
# # we want to rotate node 36 so node 39 is on the bottom and nodes 37 and 38 move to the top:
# 
# rotate(p1, 39) %>% rotate(37)+
#   ggtitle("Rotated Node 36")
# 
# #or we can use the flip command to achieve the same thing:
# flip(p1, 39, 37)

Example subtree with sample data annotation:

Lets say we are investigating the cluster of cases with clonal expansion which occured in 2017 and 2018 at node 39 in our sub-tree. We add the year of strain isolation as well as travel history and color by country to see origin of other closely related strains:

# Add sample data:
ggtree(sub_tree2) %<+% sample_data + 
   geom_tiplab(size =2.5, offset = 0.001, align = TRUE) + # labels the tips of all branche with the sample name in the tree file
  theme_tree2()+
  xlab("genetic distance")+ # add a label to the x-azis
  xlim(0, 0.015)+ # set the x-axis limits of our tree
  geom_tippoint(aes(color=Country), size=1.5)+ # color the tip point by continent
  scale_color_brewer(name = "Country", 
                       palette="Set1", 
                     na.value="grey")+
    geom_tiplab(aes(label = Year), color='blue', offset = 0.0045, size = 3, linetype = "blank" , geom = "text" , align=TRUE)+ # add isolation year
    geom_tiplab(aes(label = Travel_history), color='red', offset = 0.006, size = 3, linetype = "blank" , geom = "text" , align=TRUE)+ # add travel history
  ggtitle("Phylogenetic tree of Belgian S. sonnei strains with travel history")+ # add plot title
  theme(axis.title.x=element_blank(),
      axis.title.y=element_blank(),
     legend.title=element_text(face="bold", size =12),
       legend.text=element_text(face="bold", size =10),
      plot.title = element_text(size =12, face="bold"))
## Warning: Duplicated aesthetics after name standardisation: size

## Warning: Duplicated aesthetics after name standardisation: size

## Warning: Duplicated aesthetics after name standardisation: size

Our observation points towards an import of strains from Asia, which then circulated in Belgium over the years and seem to have caused our latest outbreak.

More complex trees: adding heatmaps of sample data

More complex trees

We can add more complex information, such as categorical presence of antimicrobial resistance genes and numeric values for actually measured resistance to antimicrobials in form of a heatmap using the ggtree::gheatmap() function.

First we need to plot our tree (this can be either linear or circular): We will use the sub_stree from part 3.)

# A: Circular tree:
p <- ggtree(sub_tree2, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =3) + 
 theme(legend.position = "none",
 axis.title.x=element_blank(),
      axis.title.y=element_blank(),
      plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))
p

Second we prepare our data. To visualize different variables with new color schemes, we subset our dataframe to the desired variable.

For example we want to look at gender and mutations that could confer resistance to ciprofloxacin:

# Create your gender dataframe:
gender <- data.frame("gender" = sample_data[,c("Gender")])
# Its important to add the Sample_ID as rownames otherwise it cannot match the data to the tree tip.labels:
rownames(gender) <- sample_data$Sample_ID

# Create your ciprofloxacin dataframe based on mutations in the gyrA gene:
cipR <- data.frame("cipR" = sample_data[,c("gyrA_mutations")])
rownames(cipR) <- sample_data$Sample_ID

# Create your ciprofloxacin dataframe based on the measured minimum inhibitory concentration (MIC) from the laboratory:
MIC_Cip <- data.frame("mic_cip" = sample_data[,c("MIC_CIP")])
rownames(MIC_Cip) <- sample_data$Sample_ID

We create a first plot adding a binary heatmap for gender to the phylogenetic tree:

# First we add gender:
h1 <-  gheatmap(p, gender, offset = 10, width=0.10, color=NULL, # offset shifts the heatmap to the right, width defines the width of the heatmap column, color defines the boarder of the heatmap columns
         colnames = FALSE)+ # hides column names for the heatmap
  scale_fill_manual(name = "Gender", # define the coloring scheme and legend for gender
                    values = c("#00d1b1", "purple"),
                    breaks = c("Male", "Female"),
                    labels = c("Male", "Female"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h1

Then we add information on ciprofloxacin resistance genes:

# First we assigng a new color scheme to our existing plot, this enables us to define and change the colors for our second variable
h2 <- h1 + new_scale_fill() 

# then we combine these into a new plot:
h3 <- gheatmap(h2, cipR,  offset = 12, width=0.10, # adds the second row of heatmap describing ciprofloxacin resistance genes
                colnames = FALSE)+
  scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
                    values = c("#fe9698","#ea0c92"),
                    breaks = c( "gyrA D87Y", "gyrA S83L"),
                    labels = c( "gyrA d87y", "gyrA s83l"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())+
  guides(fill=guide_legend(nrow=2,byrow=TRUE))
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h3

Next we add continuous data on actual resistance determined by the laboratory as the minimum inhibitory concentration (MIC) of ciprofloxacin :

# First we add the new coloring scheme:
h4 <- h3 + new_scale_fill()

# then we combine the two into a new plot:
h5 <- gheatmap(h4, MIC_Cip,  offset = 14, width=0.10,
                colnames = FALSE)+
  scale_fill_continuous(name = "MIC for ciprofloxacin",
                      low = "yellow", high = "red",
                      breaks = c(0, 0.50, 1.00),
                      na.value = "white")+
   guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h5

We can do the same exercise for a linear tree:

# B: Lineartree:
p <- ggtree(sub_tree2) %<+% sample_data +
  geom_tiplab(size =3) + # labels the tips
  theme_tree2()+
  xlab("genetic distance")+
  xlim(0, 0.015)+
 theme(legend.position = "none",
      axis.title.y=element_blank(),
      plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))


# First we add gender:

h1 <-  gheatmap(p, gender, offset = 0.003, width=0.1, color="black", 
         colnames = FALSE)+
  scale_fill_manual(name = "Gender",
                    values = c("#00d1b1", "purple"),
                    breaks = c("Male", "Female"),
                    labels = c("Male", "Female"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
# h1

# Then we add ciprofloxacin after adding another colorscheme layer:

h2 <- h1 + new_scale_fill()
h3 <- gheatmap(h2, cipR,  offset = 0.004, width=0.1,color="black",
                colnames = FALSE)+
  scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
                    values = c("#fe9698","#ea0c92"),
                    breaks = c( "gyrA D87Y", "gyrA S83L"),
                    labels = c( "gyrA d87y", "gyrA s83l"))+
   theme(legend.position="bottom",
        legend.title = element_text(size=12),
        legend.text = element_text(size =10),
        legend.box="vertical", legend.margin=margin())+
  guides(fill=guide_legend(nrow=2,byrow=TRUE))
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
# h3

# Then we add the minimum inhibitory concentration determined by the lab (MIC):
h4 <- h3 + new_scale_fill()
h5 <- gheatmap(h4, MIC_Cip, offset = 0.005, width=0.1, color="black", 
                colnames = FALSE)+
  scale_fill_continuous(name = "MIC for ciprofloxacin",
                      low = "yellow", high = "red",
                      breaks = c(0,0.50,1.00),
                      na.value = "white")+
   guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(legend.position="bottom",
        legend.title = element_text(size=10),
        legend.text = element_text(size =8),
        legend.box="horizontal", legend.margin=margin())+
  guides(shape = guide_legend(override.aes = list(size = 2)))
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h5

Interactive plots

Data visualisation is increasingly required to be interrogable by the audience. Consequently creating interactive plots are becoming common. There are several ways to include these but the two most important are {plotly} and {shiny}.

{Shiny} is covered in another part of this handbook, so we will only cover {plotly} here. #TODO - link to shiny page

Overview

Making plots interactive can sound more difficult than it turns out to be, thanks to some fantastic tools.

In this section, you’ll learn to easily make a plot interactive with {the wonders {ggplot2} and {plotly}

## Warning: Removed 3 rows containing missing values (position_stack).

Preparation

In the example you saw a very basic epicurve that had been transformed to bbe interactive using the fantastic {ggplot2} - {plotly} integrations. So to start, make a basic chart of your own:

Loading data

linelist <- rio::import("linelist_cleaned.xlsx")

Manipulate and add columns (best taught in the epicurves section)

linelist <- linelist %>% 
  dplyr::mutate(
    ## If the outcome column is NA, change to "Unknown"
    outcome = dplyr::if_else(condition = is.na(outcome),
                             true = "Unknown",
                             false = outcome),
    ## If the date of infection is NA, use the date of onset instead
    date_earliest = dplyr::if_else(condition = is.na(date_infection),
                                   true = date_onset,
                                   false = date_infection),
    ## Summarise earliest date to earliest week 
    week_earliest = lubridate::floor_date(x = date_earliest,
                                          unit = "week",
                                          week_start = 1)
    )

Count for plotting

## Find number of cases in each week by their outcome
linelist <- linelist %>% 
  dplyr::count(week_earliest, outcome)

Plot

Make into a plot

p <- linelist %>% 
  ggplot()+
  geom_col(aes(week_earliest, n, fill = outcome))+
  xlab("Week of infection/onset") + ylab("Cases per week")+
  theme_minimal()

Make interactive

p <- p %>% 
  plotly::ggplotly()

Voila!

p
## Warning: Removed 3 rows containing missing values (position_stack).

Modifications

When exporting in an Rmarkdown generated HTML (like this book!) you want to make the plot as small as possible (with no negative side effects in most cases). For this, just add add this line:

p <- p %>% 
  plotly::partial_bundle()

Some of the buttons on a standard plotly (as shown on the preparation tab) are superfluous and can be distracting, so it’s best to remove them. You can do this simply by piping the output into plotly::config

## these buttons are superfluous/distracting
plotly_buttons_remove <- list('zoom2d','pan2d','lasso2d', 'select2d','zoomIn2d',
                              'zoomOut2d','autoScale2d','hoverClosestCartesian',
                              'toggleSpikelines','hoverCompareCartesian')

p <- p %>% 
  plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

Example

Earlier you saw #TODO link to heatmaps how to make heatmaps, and they are just as easy to make interactive.

## `summarise()` has grouped output by 'location_name'. You can override using the `.groups` argument.
## Joining, by = c("location_name", "week")
metrics_plot %>% 
  ggplotly() %>% 
  partial_bundle() %>% 
  config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

Maps - preparation

You can even make interactive maps! However, they’re slightly trickier. Although {plotly} works well with ggplot2::geom_sf in RStudio, when you try to include it’s outputs in Rmarkdown HTML files (like this book), it doesn’t work well.

So instead you can use {plotly}’s own mapping tools which can be tricky but are easy when you know how. Read on…

We’re going to use Covid-19 incidence across African countries for this example. The data used can be found on the World Health Organisation website.

You’ll also need a new type of file, a GeoJSON, which is sort of similar to a shp file for those familiar with GIS. For this book, we used one from here.

GeoJSON files are stored in R as complex lists and you’ll need to maipulate them a little.

## You need two new packages: {rjson} and {purrr}
pacman::p_load(plotly, rjson, purrr)

## This is a simplified version of the WHO data
df <- rio::import(here::here("data", "covid_incidence.csv"))

## Load your geojson file
geoJSON <- rjson::fromJSON(file=here::here("data", "africa_countries.geo.json"))

## Here are some of the properties for each element of the object
head(geoJSON$features[[1]]$properties)
## $scalerank
## [1] 1
## 
## $featurecla
## [1] "Admin-0 country"
## 
## $labelrank
## [1] 6
## 
## $sovereignt
## [1] "Burundi"
## 
## $sov_a3
## [1] "BDI"
## 
## $adm0_dif
## [1] 0

This is the tricky part. For {plotly} to match your incidence data to GeoJSON, the countries in the geoJSON need an id in a specific place in the list of lists. For this we need to build a basic function:

## The property column we need to choose here is "sovereignt" as it is the names for each country
give_id <- function(x){
  
  x$id <- x$properties$sovereignt  ## Take sovereignt from properties and set it as the id
  
  return(x)
}

## Use {purrr} to apply this function to every element of the features list of the geoJSON object
geoJSON$features <- purrr::map(.x = geoJSON$features, give_id)

Maps - plot

plotly::plot_ly() %>% 
  plotly::add_trace(                    #The main plot mapping functionn
    type="choropleth",
    geojson=geoJSON,
    locations=df$Name,          #The column with the names (must match id)
    z=df$Cumulative_incidence,  #The column with the incidence values
    zmin=0,
    zmax=57008,
    colorscale="Viridis",
    marker=list(line=list(width=0))
  ) %>%
  plotly::colorbar(title = "Cases per million") %>%
  plotly::layout(title = "Covid-19 cumulative incidence",
                 geo = list(scope = 'africa')) %>% 
  plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

Resources

Plotly is not just for R, but also works well with Python (and really any data science language as it’s built in JavaScript). You can read more about it on the plotly website

VI Advanced

Advanced RStudio

Overview

rprofiles

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Relational databases

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Routine reports

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

R Markdown

R Markdown is a fantastic tool for creating automated, reproducible, and share-worthy outputs. It can generate static or interactive outputs, in the form of html, word, pdf, powerpoint, and others.

Overview

Using markdown will allow you easily recreate an entire formatted document, including tables/figures/text, using new data (e.g. daily surveillance reports) and/or subsets of data (e.g. reports for specific geographies).

This guide will go through the basics. See ‘resources’ tab for further info.

Preparation

Preparation of an R Markdown workflow involves ensuring you have set up an R project and have a folder structure that suits the desired workflow.

For instance, you may want an ‘output’ folder for your rendered documents, an ‘input’ folder for new cleaned data files, as well as subfolders within them which are date-stamped or reflect the subgeographies of interest. The markdown itself can go in a ‘rmd’ subfolder, particularly if you have multiple Rmd files within the same project.

You can set code up to create output subfolders for you each time you run reports (see “Producing an output”), but you should have the overall design in mind.

Because R Markdown can run into pandoc issues when running on a shared network drive, it is recommended that your folder is on your local machine, e.g. in a project within ‘My Documents’. If you use Git (much recommended!), this will be familiar.

The R Markdown file

An R Markdown document looks like and can be edited just like a standard R script, in R Studio. However, it contains more than just the usual R code and hashed comments. There are three basic components:

1. Metadata: This is referred to as the ‘YAML metadata’ and is at the top of the R Markdown document between two ‘—‘s. It will tell your Rmd file what type of output to produce, formatting preferences, and other metadata sucsh as document title, author, and date. There are other uses not mentioned here (but referred to in ‘Producing an output’). Note that indentation matters.

2. Text: This is the narrative of your document, including the titles. It is written in the markdown language, used across many different programmes. This means you can add basic formatting, for instance:

  • _text_ or *text* to italicise
  • **text** for bold text
  • # at the start of a new line for a title (and ## for second-level title, ## for third-level title etc)
  • * at the start of a new line for bullet points
  • text to display text as code (as above)

The actual appearance of the font can be set by using specific templates (specified in the YAML metadata; see example tabs).

You can also include minimal R code within backwards ticks, for within-text values. See example below.

3. Code chunks: This is where the R code goes, for the actual data management and visualisation. To note: These ‘chunks’ will appear to have a slightly different background colour from the narrative part of the document.

Each chunk always starts with three backticks and chunk information within squiggly brackets, and ends with three more backticks.

Some notes about the content of the squiggly brackets:

  • They start with ‘r’ to indicate that the language name within the chunk is r
  • Followed by the chunk name - note this should ALWAYS be a unique name or else R will complain when you try to render.
  • It can include other options too, but many of these can be configured with point-and-click using the setting buttons at the top right of the chunk. Here, you can specify which parts of the chunk you want the rendered document to include, namely the code, the outputs, and the warnings. This will come out as written preferences within the squiggly brackets, e.g. ‘echo=FALSE’ if you specify you want to ‘Show output only’.

There are also two arrows at the top right of each chunk, which are useful to run code within a chunk, or all code in prior chunks.

Producing an output

General notes

Everything used by this markdown must be referenced within the Rmd file. For instance, you need to load any required packages or data.

A single or test run from within R Markdown

To render a single document, for instance if you are testing it or if you only need to produce one rendered document at a time, you can do it from within the open R Markdown file. Click the “knit” button at the top of the document.

The ‘R Markdown’ tab will start processing to show you the overall progress, and a complete document will automatically open when complete. This document will also be saved in the same folder as your markdown, and with the same file name aside from the file extension. This is obviously not ideal for version control, as you will then rename the file yourself.

A single run from an separate script

To run the markdown so that a date-stamped file is produced, you can create a separate script and call the Rmd file from within it. You can also specify the folder and file name, and include a dynamic date and time, so that file will be date stamped on production.

rmarkdown::render(("rmd_reports/create_RED_report.Rmd"),  
                        output_file = paste0("outputs/Report_", Sys.Date, ".docx")) # Use 'paste0' to combine text and code for a dynamic file name

Routine runs into newly created date-stamped sub folders

Add a couple lines of code to define the date you are running the report (e.g. using Sys.Date as in the example above) and create new sub folders. If you want the date to reflect a specific date rather than the current date, you can also enter it as an object.

# Set the date of report
refdate <- as.Date("2020-12-21")

# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)

#Run the loop
rmarkdown::render(("rmd_reports/create_report.Rmd"),  
                        output_file = paste0(outputfolder, "/Report_", refdate, ".docx")) #Dyanmic folder name now included

You may want some dynamic information to be included in the markdown itself. This is addressed in the next section.

Parametrised reports

Parameterised reports are the next step so that the content of the R Markdown itself can also be dynamic. For example, the title can change according to the subgeography you are running, and the data can filter to that subgeography of interest.

Let’s say you want to run the markdown to produce a report with surveillance data for Area1 and Area2. You will:

  1. Edit your R Markdown:
  1. Change your YAML metadata to include a ‘params’ section, which specifies the dynamic object.
  2. Refer to this parameterised object within the code as needed. E.g. filter(area == params$areanumber) rather than filter(area=="Area1").

For instance (simplified version which does not include setup code such as library/data loading):

You can change the content by editing the YAML as needed, or set up a loop in a separate script to iterate through the areas. As with the previous section, you can set up the folders as well.

As you can see below, you set up a list which includes all areas of interest (arealist), and when rendering the markdown you specify that the parameterized areanumber for a specific iteration is the Nth value of the arealist. For instance, for the first iteration, areanumber will equate to “Area1”. The code below also specifies that the Nth area name will be included in the output file name.

Note that this will work even if an area or date are specified within the YAML itself - that YAML information will get overwritten by the loop.

# Set the date of report
refdate <- as.Date("2020-12-21")

# Set the list (note that this can also be an imported list)
arealist <- c("Area1", "Area2", "Area3", "Area4", "Area5")

# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)

#Run the loop

for(i in 1:length(arealist))  { # This will loop through from the first value to the last value in 'arealist'

rmarkdown::render(here("rmd_reports/create_report.Rmd"), 
                        params = list(areanumber = arealist[1], #Assigns the nth value of arealist to the current areanumber
                                      refdate = refdate),
                        output_file = paste0(outputfolder, "/Report_", arealist[1], refdate, ".docx")) 
                        
}

Shiny basics

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

flexdashboard

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Collaboration

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Writing functions

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

R on network drives

Overview

Using R on network or “company” shared drives can be extremely frustrating. This page contains approaches, common errors, and suggestions on troubleshooting, including for the particularly delicate situations involving Rmarkdown.

Using R on Network Drives: Overarching principles

  1. Must have administrator access on your computer. Setup RStudio specifically to run as administrator.
  2. Use your “\" package library as little as possible, save packages to”C:" library when possible.
  3. the rmarkdown package must not be in a "\" library, as then it can’t talk to TinyTex or Pandoc.

Preparation

Using R on Network Drives: Overarching principles

  1. Must have administrator access on your computer. Setup RStudio specifically to run as administrator.
  2. Use your “\" package library as little as possible, save packages to”C:" library when possible.
  3. the rmarkdown package must not be in a "\" library, as then it can’t talk to TinyTex or Pandoc.

Useful commands

# Find libraries
.libPaths()                   # Your library paths, listed in order that R installs/searches. 
                              # Note: all libraries will be listed, but to install to some (e.g. C:) you 
                              # may need to be running RStudio as an administrator (it won't appear in the 
                              # install packages library drop-down menu) 

# Switch order of libraries
# this can effect the priority of R finding a package. E.g. you may want your C: library to be listed first
myPaths <- .libPaths() # get the paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign them

# Find Pandoc
Sys.getenv("RSTUDIO_PANDOC")  # Find where RStudio thinks your Pandoc installation is

# Find a package
# gives first location of package (note order of your libraries)
find.package("rmarkdown", lib.loc = NULL, quiet = FALSE, verbose = getOption("verbose")) 

Troubleshooting common errors

“Failed to compile…tex in rmarkdown”

check/install tinytex, to C: location

# check/install tinytex, to C: location
tinytex::install_tinytex()
tinytex:::is_tinytex() # should return TRUE (note three colons)

Internet routines cannot be loaded

For example, “Error in tools::startDynamicHelp() : internet routines cannot be loaded”

  • Try selecting 32-bit version from RStudio via Tools/Global Options.
    • note: if 32-bit version does not appear in menu, make sure not using RStudio v1.2.
  • Or try uninstalling R and re-installing with different bit (32 instead of 64)

C: library does not appear as an option when I try to install packages manually

  • Must run RStudio as an administrator, then it will appear.
  • To set-up RStudio to always run as administrator (advantageous when using an Rproject where you don’t click RStudio icon to open)… right-click the Rstudio icon, open properties, compatibility, and click the checkbox Run as Administrator.

Pandoc 1 error

If you are getting pandoc error 1 when knitting Rmarkdowns on network drives:

myPaths <- .libPaths() # get the library paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign them

Pandoc Error 83 (can’t find file…rmarkdown…lua…)
This means that it was unable to find this file.

See https://stackoverflow.com/questions/58830927/rmarkdown-unable-to-locate-lua-filter-when-knitting-to-word

Possibilities:

  1. Rmarkdown package is not installed
  2. Rmarkdown package is not findable
  3. an admin rights issue.

R is not able to find the ‘rmarkdown’ package file, so check which library the rmarkdown package lives. If it is in a library that in inaccessible (e.g. starts with "\") consider manually moving it to C: or other named drive library.
But be aware that the rmarkdown package has to be able to reach tinytex, so rmarkdown package can’t live on a network drive.

Pandoc Error 61 For example: “Error: pandoc document conversion failed with error 61”

“Could not fetch…”

  • Try running RStudio as administrator (right click icon, select run as admin, see above instructions)
  • Also see if the specific package that was unable to be reached can be moved to C: library.

LaTex error (see below)

“! Package pdftex.def Error: File `cict_qm2_2020-06-29_files/figure-latex/unnamed-chunk-5-1.png’ not found: using draft setting.”

“Error: LaTeX failed to compile file_name.tex.”
See https://yihui.org/tinytex/r/#debugging for debugging tips. See file_name.log for more info.

Pandoc Error 127 This could be a RAM (space) issue. Re-start your R session and try again.

Mapping network drives

How does one open a file “through a mapped network drive”?

  • First, you’ll need to know the network location you’re trying to access.
  • Next, in the Windows file manager, you will need to right click on “This PC” on the right hand pane, and select “Map a network drive”.
  • Go through the dialogue to define the network location from earlier as a lettered drive.
  • Now you have two ways to get to the file you’re opening. Using the drive-letter path should work.

From: https://stackoverflow.com/questions/48161177/r-markdown-openbinaryfile-does-not-exist-no-such-file-or-directory/55616529?noredirect=1#comment97966859_55616529

ISSUES WITH HAVING A SHARED LIBRARY LOCATION ON NETWORK DRIVE

Error in install.packages()

Try removing… /../…/00LOCK (directory)

  • Manually delete the 00LOCK folder directory from your package the library. Try installing again.
  • You can try the command pacman::p_unlock() (you can also put this command in the Rprofile so it runs every time project opens.)
  • Then try installing the package again. It may take several tries.
  • If all else fails, install the package to another library and then manually copy it over.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Directory interactions

Overview

Saving files, deleting files, creating folders, interacting with files in a folder, etc Overwriting files in Excel

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.